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Efficient implementations of the classical molecular dynamics (MD) method for Lennard- 
Jones particle systems are considered. Not only general algorithms but also techniques 
that are efficient for some specific CPU architectures are also explained. A simple spatial- 
decomposition-based strategy is adopted for parallelization. By utilizing the developed code, 
benchmark simulations are performed on a HITACHI SR16000/J2 system consisting of IBM 
POWER6 processors which are 4.7 GHz at the National Institute for Fusion Science (NIPS) 
and an SGI Altix ICE 8400EX system consisting of Intel Xeon processors which are 2.93 GHz 
at the Institute for Solid State Physics (ISSP), the University of Tokyo. The parallelization 
efficiency of the largest run, consisting of 4.1 billion particles with 8192 MPI processes, is 
about 73% relative to that of the smallest run with 128 MPI processes at NIFS, and it is 
about 66% relative to that of the smallest run with 4 MPI processes at ISSP. The factors 
causing the parallel overhead are investigated. It is found that fluctuations of the execution 
time of each process degrade the parallel efficiency. These fluctuations may be due to the 
interference of the operating system, which is known as OS Jitter. 

§1. Introduction 

The classical molecular dynamics (MD) simulation was first performed by Alder 
and Wainwright.'D' The original MD was used to investigate the hard-particle sys- 
tem, and an event-driven algorithm was adopted for the time evolution, which was 
soon followed by a time-step-driven algorithm.!^ Owing to the recent increase in 
computational power, MD is now a powerful tool for studying not only the equilib- 
rium properties but also the nonequilibrium transportation phenomena of molecular 
system^® 'IS} 'EJ as well as biomolecular systems .'^''^ Electronic degrees of freedom 
can also be considered by ah initio methods on the basis of quantum mechanics.'^ 
The increase in computational power also allows us not only to use more realistic 
and complex interactions but also to treat a larger number of particles. The recent 
increase in computational power has mainly been achieved by increasing the number 
of processing cores. This paradigm is called massively parallel processing (MPP), 
and most recent high-end computers have been built on the basis of this paradigm.^I^ 
The number of cores is typically from several thousand to several hundred thousand. 
In an MPP system, a single task is performed by a huge number of microproces- 
sors, which operate simultaneously and communicate with each other as needed. 
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Therefore, parallelization is now unavoidable in order to utilize the computational 
power of such machines effectively. While parallelization itself is of course impor- 
tant, the tunings for a single core have become more important for huge-scale and 
long-time simulations, since the overall performance of MD mainly depends on the 
performance on the single cores. In particular, optimizing memory access is impor- 
tant. The latency and bandwidth of memory access are sometimes slow compared 
with the execution speed of processors, and the data supply often cannot keep up 
with the requests by the processors. Therefore, data must be suitably arranged so 
that necessary data is located near a processor. Tuning for specific architectures is 
also important to achieve high performance. Since design concepts differ from one 
processor to another, it is necessary to prepare codes for each architecture, at least 
for the hot spot of the simulation where most time is spent during the execution, 
which is usually force calculation in MD. 

The Next-Generation Supercomputer project is currently being carried out by 
RIKEN.H'' The system is designed on the basis of the MPP paradigm and will use 
scalar CPUs with 128 GFlops, will consist of over 80 000 nodes, and is expected to 
achieve a total computational power of 10 petaflops. This supercomputer is now un- 
der development and will be ready in 2012. Therefore, we must prepare parallelized 
codes that can utilize the full performance of this system by then. The parallelization 
of MD has been discussed over several decades with the aims of treating larger sys- 
tems and performing simulations for longer timescales,^--- and a huge MD simulation 
involving one trillion particles has been performed on BlueGene/L, which consists 
of 212 992 processors Despite this success, it will not easy to obtain a satisfac- 
tory performance on the Next-Generation Supercomputer since the properties of the 
systems are considerably different from those of BlueGene. BlueGene has relatively 
slow processors with a 700 MHz clock speed for BlueGene/L, and consequently, it 
achieves a good balance between the processor speed and the memory bandwidth. 
In comparison, the Next-Generation Supercomputer will have a relatively high peak 
performance of 128 GFlops per CPU (16 GFlops for each core and each CPU consists 
of eight cores). Therefore, the memory bandwidth and latency may be a significant 
bottleneck. 

The purpose of the present article is to describe efficient algorithms and their 
implementation for the MD method, particularly focusing on the memory efficiency 
and the parallel overheads. We also provide some tuning techniques for a couple of 
the specific architectures. While the Lennard-Jones potential is considered through- 
out the manuscript, the techniques described here are generally applicable to other 
simulations with more complicated potentials. This manuscript is organized as fol- 
lows. In Sec. [21 algorithms for finding interacting particle pairs are described. Some 
further optimization techniques whose efficiency depends on architecture of the com- 
puter are explained in Sec. El Parallelization schemes and the results of benchmark 
simulations are described in Sec. HI The factors causing the parallel overhead for 
the developed MD code are discussed in Sec. \5\ and a summary and a discussion of 
further issues are given in Sec. El 
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Fig. 1. Truncation of the Lennard- Jones potential function, (a) Original form. The interaction 
range extends to infinity, (b) Truncation by interpolation. An interpolating function is intro- 
duced for the region from the interpolation point ri to the truncation point rc- (c) Truncation 
by adding extra terms. The additional function f{r) is chosen so that both the value and the 
derivatives of V(r) + /(r) are zero at the truncation point. 



for the particle distance r, well depth e, and atomic diameter a. In the following, 
we will use the physical quantities reduced by a, e, and k^, for example., the length 
scale is measured in the unit of a, and so forth. The first term of the right-hand side 
of Eq. (|2Tp describes the repulsive force, and the other term describes the attrac- 
tive force. This potential form has been widely used for a standard particle model 
involving phase transitions since this system exhibits three phases: solid, liquid, and 
gas. While the original form of the Lennard-Jones potential spans infinite range, it 
is wasteful to consider an interaction between two particles at a long distance since 
the potential decays rapidly as the distance increases. To reduce computation time, 
truncated potentials are usually used instead of the original potential.'^J There are 
several ways of introducing truncation. One way is to use cubic spline interpolation 
for the region from some distance to the cutoff lengthP)>[I3 as shown in Fig. [1] (b). 
In this scheme, there are two special distances, the interpolating distance n and 
the truncation distance Vc- The potential is expressed by a piecewise-defined func- 
tion that switches from V{r) given in Eq. (j2-ip to an interpolating function Vi{r) at 
r = Tj. The interpolation function is chosen so that values and the derivatives of the 
potential are continuous at the interpolation and the truncation points. These con- 
ditions require four adjustable parameters, and therefore, a third-order polynomial 
is usually chosen as the interpolating function. While this interpolation scheme was 
formerly popular, it is now outdated since it involves a conditional branch, which is 
sometimes expensive for current CPU architectures. 



§2. Pair List Construction 



2.1. Truncation of Potential Function 

The common expression for the Lennard-Jones potential is 
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Nowadays, truncation is usually introduced by adding some extra terms to the 
potential. At the truncation point, both the potential and the force should become 
continuous, and therefore, at least two additional terms are necessary. One such 
potential is given bjff^ 

Vir) = 4e 

with two additional coefficients C2 and cq, which are determined so that V{rc) = 
^'(fc) = for the cutoff length rc- Here, we use a quadratic term instead of a linear 
term such as cir + cq, since the latter involves a square-root operation in the force 
calculation, which is sometimes expensive. While the computational cost of the force 
computation is more expensive than that for the interpolating method, this method 
is usually faster than the interpolating method since conditional branches can be 
hazards in the pipeline that obstruct the instruction stream. If accuracy is less 
crucial, one can truncate the potential by adding one constant, i.e., C2 = and cq 
is determined by V{rc) = 0. In this case, the calculation of force is identical to that 
using the original potential and the modification only appears in the calculation of the 
potential energy. Note that this scheme sometimes involves some problems regarding 
the conservation of energy since the force is not continuous at the truncation point. 
The truncation changes the phase diagram. One should check the phase diagram and 
compute phase boundaries for each truncated potential before performing production 
runs. In particular, the gas-liquid coexistence phase becomes drastically narrower as 
the cutoff length decreases. To investigate phenomena involving the gas-liquid phase 
transition, the cutoff length should be made longer than 3.0a. 

2.2. Grid Search Method 

To perform the time evolution of a system, we first have to find particle pairs 
such that the distance between the particles is less than the cutoff length. Since the 
trivial computation leads to 0{N'^) computation, where is the number of particles, 
we adopt a grid algorithm, which is also called the linked-list method,'^''^ to reduce 
the complexity of the computation to 0{N). The algorithm finds interacting pairs 
with in a grid by the following three steps: (i) divide a system into small cells, 
(ii) register the indices of particles on cells to which they belong, and (iii) search 
for interacting particle pairs using the registered information. There are two types 
of grid, one is exclusive and the other is inexclusive (see Fig. [2]). An exclusive grid 
allows only one particle to occupy a cell,^ and the inexclusive grid allows more than 
one particle to occupy one cell simultaneously.'2I''ESl.[2IJ>[I3 Generally, an exclusive 
grid is efficient for a system with short-range interactions such as hard particles, and 
an inexclusive grid is suitable for a system with relatively long interactions such as 
Lennard-Jones particles. However, the efficiency strongly depends not only on the 
physical properties such as density, polydispersity, and interaction length but also on 
the hardware architecture such as the cache size and memory bandwidth. Therefore, 
it is difficult to determine which is better before implementation. We have found 
that an inexclusive grid is more efficient than an exclusive grid for the region in 
which a Lennard-Jones system involves a gas-liquid phase transition. Therefore, we 
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Fig. 2. Searching for interacting pairs in a grid, (a) Exclusive grid. The size of each cell is 
determined so that only one particle can be placed in each cell, (b) Inexclusive grid. The 
interaction length determines the size of cells. More than one particle is allowed to occupy one 
cell simultaneously. 



adopt an inexclusive grid in the following. Note that the length of each cell should be 
larger than the search length rg, which is longer than the cutoff length as described 
later, so that there are no interactions occurring through three or more cells, in other 
words, interacting particle pairs are always in the same cell or adjacent cells. 

A simple way to express the grid information is to use multidimensional arrays. 
Two types of array are necessary, one is for the number of particles in each cell, and 
the other is for the indices of the particles in each cell. Suppose the total number of 
cells is Ug = Ugx X Ugy X Ugz, theu the arrays can be expressed by 

integer: GridParticleNumber[ng2:] [n^y] [n^^] (2 3) 

integer : Gridlndex[nga:] [ugy] [n^^] [g 

maxj ) 

where gmax is the capacity of one cell. The number of particles in a cell at {x, y, z) 
is stored as GridParticleNumber[x][y][z] and the index of the «-th particle at the 
cell is sotred as Gridlndex[x][y][z][i]. Although this scheme is simple, the required 
memory can be dozens of times larger than the total number of particles in the 
system, and therefore, the indices of the particles are stored sparsely, which causes 
a decrease in cache efficiency. To improve the efficiency of memory usage, the in- 
dices of particles should be stored sequentially in an array. This method is proposed 
by Kadau et al. and implemented in SPaSM.I221i An algorithm to construct grid 
information using an array whose size is the total number of particles is as fol- 
lows. 

1: Divide a system into small cells, and assign a serial number to each cell. 
2: Label particles with the serial numbers of the cells that they belong to. 
3: Count the number of particles in each cell. 
4: Sort the particles by their labels. 
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Fig. 3. How to construct GridList, which is a hnear array of size A'^ storing grid information. 
Suppose there are four cells and 14 particles in a system (A'^ = 14). The left figure shows the 
stored data in the array and the right figure shows the configuration of the particles in the 
system. The numbers with arrows denote pointers indicating where the indices of particles 
should be stored (GridPointer) . The particles have indices from 1 to 14 and the cells are labeled 
from 1 to 4. First, prepare an array of size A'^, and label particles with the serial number of the 
cells to which they belong, (i) Count the number of particles in each cell, store the number as 
GridParticleNumber, and set a pointer for each cell at the position where the index of the first 
particle of each grid should be stored when the particles are sorted by their labels, (ii) Place 
the index of a particle at the position where the pointer corresponding to its label points, and 
move the pointer to the right, (iii) Repeat this procedure for all particles. The status of the 
GridList just after the particle number 6 is located into the GridList. (iv) The completed array 
after above (i) - (iii) procedures. The particles are sorted in GridList in order of the grid labels. 



The details of the implementation are shown in Fig. [3l Suppose that i-cell de- 
notes the cell labeled with i and the grid information is stored in the linear array 
GridList. Then GridParticleNumber [i] denotes the number of particles in i-cell and 
GridPointer [i] denotes the position of which the first particle of i-cell should be stored 
in GridList. The complexity of this sorting process is 0{N) instead of O(A^logA^) 
for usual Quicksort since it does not perform a complete sorting. The memory usage 
is effective since the indices of particles in each grid are stored contiguously in the 
array. 

2.3. Bookkeeping Method 

After constructing GridList, we have to find interacting particle pairs using the 
grid information. Suppose the particle pairs whose distance is less than the cutoff 
length are stored in PairList. PairList is an array of pairs which hold two indicies 
of interacting particles. All particles are registered in cells and labeled by the serial 
numbers of the cells. There are two cases, interacting particles share or do not 
share the same label, which correspond to inner-cell interaction and cell-crossing 
interaction. Let GridList [s..e] denote a subarray of GridList whose range is from s 
to e with the data of the pair list stored in PairList. The pseudocodes used to find 
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Algorithm 1 Finding the interacting particle pairs 



for all z-cell do 
Si ^ GridPointer[i] 
Cj Si+ GridParticleNumber[i] -1 
Lc GridList[sj..ej] 

for all j such that j-cell is the neighbor of i-cell do 

Sj -It- GridPointer[j] 

ej Sj+ GridParticIeNumber[j] -1 

Append GridList[sj..ej] to Lc 
end for 

for k = 1 to GridParticleNumber[i] do 
for / = A; + 1 to \Lc\ do 
m ^ Lc[k] 
n ^ Lc[l] 

if m- and n-particles interact each other then 

Register a pair (m, n) to PairList 
end if 
end for 
end for 
end for 



interacting pairs are shown in Algorithm [TJ Only 13 of the 26 neighbors should be 
checked for each cell because of Newton's third law. 

As the number of particles increases, the cost of constructing the pair list in- 
creases drastically since it involves frequent access to memory. While it depends 
on the details of the system, constructing a pair list is sometimes dozens of times 
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more expensive than the force calculation. In order to reduce the cost of construct- 
ing a pair list, a well-known bookkeeping method was proposedPS The main idea 
of the bookkeeping method is to reuse the same pair list for several time steps by 
registering pairs within a search length rg which is set at longer than the cutoff (in- 
teraction) length Tc (see Fig. U]). The margin rg — Tc determines the lifetime of a 
pair list. While a longer margin gives a longer lifetime, the length of the pair list 
also increases, and consequently, the computational time for the force calculation 
becomes longer. Therefore, there is an optimal length for the margin that depends 
on the details of the system, such as its density and temperature. Several steps after 
the construction of the pair list, the list can become invalid, i.e., some particle pair 
that is not registered in the list may be at a distance shorter than the interaction 
length Tc. Therefore, we have to calculate the expiration time of the list and can 
check the validity of the pair list for each step as the following simple method. The 
expiration time tf. of a pair list constructed at time t is given by 

te = + t, (2-4) 

where Umax is the absolute value of the velocity of the fastest particle in the system. 
The factor 2 in the denominator corresponds to the condition that two particles 
undergo a head-on collision. It is necessary to update the expiration time when 
the maximum velocity changes to 

t,^{te-t)^ + t, (2-5) 

^new 

where and v^ew denote the previous and current maximum speeds, respectively. 
Obviously, the expiration time will be brought forward in the case of a faster particle, 
and vice versa. This technique for calculating the expiration time is called the 
dynamical upper time cutoff (DUTC) method and was proposed by Isobe.'^ 

Although the validity check by considering the maximum velocity is simple, the 
expiration time te determined by the velocities of the particles is shorter than the 
actual expiration time of the pair list, since the pair list is valid as long as the 
particles migrate short distances. Therefore, we propose a method which extends 
the lifetime of a pair list by utilizing the displacements of the particles. The strict 
condition for the expiration of a pair list is that there exists a particle pair {i,j) such 
that 

Iq"'*" - qf I > rs and |q, - q,.| < r,, (2-6) 

where q?''^ is the position of the i-particle when the pair list was constructed and q^ 
is the current position of the z-particle. The lifetime of a pair list can be extended 
if we use the strict condition for the validity check, but the computational cost of 
checking this the condition is 0{N'^). We therefore extend the lifetime of the pair 
list by using a less strict condition. The main idea is to prepare a list of particles 
that have moved by large amounts and check the distance between all pairs in the 
list. The algorithm is as follows: (i) Calculate the displacements of the particles from 
their position when the pair list was constructed and find the largest displacement 
^^'max- (ii) If there exists a particle such that Z\rj > rg — Tc — Z\rmax) then append 
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Ts - Tc Ld 




Fig. 5. The large displacement check (LDC) method, (i) Calculate the displacements Ari for the 
particle i and find the largest displacement Zimax- (ii) Register a index of a particle i such that 
Ari > Ts — Tc — /i)"max to a list Lo whose size is denoted by |-Ld|. (iii) Old and current position of 
particles i and j. The dashed circle denote the old position which was registered in the previous 
consturction of the pair list. The solid circles denote the current positions of particles. The 
expiration of the pair list is determined by the old and the current distance of the particles. 



the particle to the hst of large displacement Ld which size is denoted by \Lj)\. If 
the list is longer than the predetermined maximum length of the list A'^i, that is, 
|Ld[ > A'^i ,then the pair list is invalidated, (iii) If the old distance is longer than the 
search length and the current distance is shoter than the cutoff length, it means that 
there are interacting particles which are not registered in a pair list. Therefore, the 
pair list is invalidated if there exists a pair (i, j) in the list of large displacement such 
that |q°^'^ — q°'^| > and |qj — q^l < re- We call this method the large displacement 
check (LDC) method. See Fig. O for a schematic description of the LDC method. 

Although the lifetime of the pair list increases with Ni, the computational cost 
also increases as 0{Nf). Therefore, there is an optimal value for Ni. Since the 
LDC method is more expensive than DUTC method, we adopt a hybrid algorithm 
involving both methods, that is, first we use the DUTC method for the validity check, 
and then we use LDC method after the DUTC method decides that the pairlist is 
expired. The full procedure of the validity check is shown in Algorithm [2l 

2.4. Sorting of pair list 

After constructing a pair list, we can calculate the forces between interacting 
pairs of particles. Hereafter, we call the particle of index i the i-particle for conve- 
nience. Suppose the positions of particles are stored in q[A^]. A simple method for 
calculating the force using the pair list is shown in Algorithm [3l This simple algo- 
rithm is, however, usually inefficient since it involves random access to the memory. 
Additionally, it fetches and stores the data of both particles of each pair, which is 
wasteful. Therefore, we construct a sorted list to improve the efficiency of memory 
usage by calculating particles together that interact with the same particles. For 
convenience, we call the lower-numbered particle in a pair the key particle and the 
other the partner particle. The partner particles with the same key particle are 
grouped together. The detail of the implementation is shown in Fig. [6l Suppose the 
data of the sorted list and the position of the first partner particle of the i-particle in 
the list are stored in SortedList and KeyPointer[i], respectively. Then the algorithm 
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Algorithm 2 Checking the vahdity of a pair hst 

1. When a pair list is constructed 

1: for all i-particle do 

2: q^^'^ ^ Qj {Keep the positions} 

3: end for 

4: buffer Jength Ts — rc 

2. Validity check by DUTC method 

1: '-'max ^ thc maxiniTim velocity of the particle 

2: bufferJength ^ bufferJength— 2i;inax^i 

3: if bufferJength > then 

4: the pair list is valid. 

5: else 

6: Check the validity by considering displacements. 

7: end if 

3. Validity check by LDC method 

1: for all i-particle do 

2: z\ri ^ |qf - qj 

3: end for 

4: /Armax ^ max{Z\ri} 

5: Prepare a list Ld 

6: for all i-particle do 

7: if Avi > Vg-Vc- Avraax then 

8: Append i-particle to Ld 

9: if |Ld| > Ni then 

10: the pair list is invalidated. 

11: end if 

12: end if 

13: end for 

14: for all (i,j) in Ld do 

15: if |q°^^ — q^^*^! > Vg and |qj — qj| < Vc then 

16: the pair list is invalidated. 

17: end if 

18: end for 

19: the pair list is invalidated. 



Algorithm 3 Calculating the force in a simple manner 
1: for all pairs in PairList do 
2: if |q[i] - qL?]! < Tc then 
3: Calculate force between i- and j-particles. 
4: end if 
5: end for 



Efficient Implementations of MD Simulations for LJ System 



11 



Key 
(KeyPointer) 

Partner 
(SortedList) 



1 2 3 4 5 6 7 



1 


3 


4 


8 


10 


11 


12 




4 


5 


8 


4 


5 


7 


8 


8 


9 


6 


7 


8 



Fig. 6. A sorted list. A list sorted by the indices of key particles is constructed by the following 
procedure, (i) Count the number of partner particles for each key particle for the particle pairs 
are stored in the pair list, (ii) Prepare a linear array of size A'^pair, and set a pointer for each 
key particle where the index of the first partner particle will be stored, (iii) Similarly to in the 
grid construction (Fig. [S]), place the index of a particle at the pointer of its key particle and 
move the position of the pointer to the next position, (iv) Repeat this procedure for all pairs to 
construct the sorted list. The figure shows two lists, KeyPointer and SortedList. KeyPointer[i] 
stores the first position of the interacting particles in SortedList. 

used to calculate the force using the sorted pair list is shown in Algorithm |H The 



Algorithm 4 Calculating the force using the sorted pair list 
1: for i = 1 to TV - 1 do 

2: Qkey ^ q[«] 

3: fkey ^ 

4: for k = KeyPointer[i] to KeyPointer[z + 1] -1 do 

5: j ^ SortedList [/c] 

6: ^ Iqicey -qbil 

7: / ^ -V'{r) 

8: Update momenta of the j-particle using /. 

9: Update fkey using /. 

10: end for 

11: Update momenta of i-particle using fkey 

12: end for 



variables q^-gy and fkey are intended to be stored in CPU registers. The amount 
of memory access decreases compared with that for Algorithm [3l since the fetching 
and storing are performed only for the data of partner particles in the inner loop. 
The amount of memory access becomes half when the key particles has an enough 
number of partner particles. The force acting on the key particle is accumulated 
during the inner loop, and the momentum of the key particle is updated at the end 
of the inner loop. Note that the loop variable i takes values from 1 to — 1, instead 
of 1 to A^, since this is the loop for the key particle, and the A^-particle cannot be a 
key particle. The value of KeyPointer [A^] should be |PairList| + 1, where |PairList| 
is the number of pairs stored in PairList. 

§3. Further Optimization 

3.1. Elimination of Conditional Branch 

Since we adopt the bookkeeping method, a pair list contains particle pairs whose 
distances are longer than the interaction length. Therefore, it is necessary to check 
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whether or not the distance between each pair is less than the interaction length 
before the force calculation (Algorithm [3]) . These conditional branches (so-called, 
"if then else" of "case" in the high-level language statement) can sometimes be ex- 
pensive for RISC (Reduced Instruction Set Computer)-based processors such as IBM 
P0WER6, and it causes a significant decrease in execution speed. To prevent the de- 
crease in speed due to conditional branches, we calculate the forces without checking 
the distance and assign zero to the calculated value for the particle pairs whose dis- 
tances are longer than the interaction length. This method is shown in Algorithm [5j 
While it appears to be wasteful, it can be faster than the original algorithm when 

Algorithm 5 Calculating the force with if-branch elimination 

1: for k = Ito Afpair do 

2: PairList[/c] 

3: r- ^ |q[i] - q[j]| 

4: / ^ -V'{r) 

5: if r > rc then 

6: /^O 

7: end if 

8: Update momenta of i- and j-particles using /. 

9: end for 



the penalty due to the conditional branches is more expensive than the additional 
cost of calculating forces. Additionally, this loop can be performed for all pairs, 
which makes it easy for a compiler to optimize codes, such as that for prefetching 
data. Note that, some CPU architectures prepare the instruction for the conditional 
substitution. For example, PowerPC architecture has fsel (Floating-Point Select) 
instruction which syntax is fsel FRT, FRA, FTC, FRB where FRT, FRA, FTC, and 
FRB are registers. If FRA > then FRT = FRB, otherwise FRT = FRC. We found that 
this optimization provides us with more than double the execution speed on IBM 
P0WER6 in the benchmark simulation whose details are described in Sec. 14.51 

3.2. Reduction of Divisions 

The explicit expression of Algorithm [3] is shown in Algorithm [6l where C2 is the 
coefficient for the truncation defined in Eq. (j2-2p and dt is the time step. For simplic- 
ity, the diamiter of the particle a is set to unity and only the force calculations are 
shown. As shown in Algorithm[6l the force calculation of the Lennard- Jones potential 
involves at least one division operation for each pair. Divisions are sometimes ex- 
pensive compared with multiplications. Therefore, computations can be made faster 
by reducing the number of divisions. Suppose there are two independent divisions 
as follows; 

A2 ^ I/B2. 

We can reduce the number of divisions by transforming them into 
C ^ l/{Bi X B2), 
Ai^Cx B2, 
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Algorithm 6 Explicit expression of force calculation 
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10 


end for 



A2^C X Bi. 

Here, the number of divisions decreases from two to one, while three multiplications 
appear. This optimization can be effective for architecture in which the penalty for 
divisions is large. We apply this optimization technique to the force calculation of 
the Lennard- Jones system by utilizing loop unrolling, that is, we first obtain two 
independent divisions by unrolling and then we apply this method of reducing the 
number of divisions to the unwound loop. Although one can apply this optimization 
technique to Algorithm [6l it is more efficient to use it together with the sorted array 
described in Sec. 12.41 The algorithm for the force calculation with a reduced number 
of divisions is shown in Algorithm [71 The term [xj denotes the largest integer less 
than or equal to x and subscriptions a and b correspond to the loop unrolled twice. 
This optimization achieves 10% speedup in the condition described in Sec. 14.51 

3.3. Cache Efficiency 

As particles move, the indices of the partner particles of each key-particle become 
random. Then the data of the interacting particles, which are spatially close, may 
be widely separated in memory. This severely decreases the computational efficiency 
since the data of particles that are not stored in the cache are frequently required. 
To improve the situation, reconstruction of the order of the particle indices in arrays 
is proposed.'^ This involves sorting so that the indices of interacting particles are 
as close together as possible. This can be implemented by using the grid informa- 
tion described in Sec. 12.21 After an array of grid information is constructed, the 
particles are arranged in the order in the array. The sorting algorithm is shown in 
Fig. [71 After sorting, the indices of the particles in the same cell become sequential, 
which improves the cache efficiency. Note that the pair list is made invalid by this 
procedure; therefore, the pair list construction and the sorting should be performed 
at the same time. 

A comparison between the computational speed with and without sorting is 
shown in Fig. [SI The simulations were performed on a Xeon 2.93 GHz machine 
with 256 KB for an L2 cache and 8 MB for an L3 cache. We performed the sort- 
ing every ten constructions of a pair list, which is typically every few hundred time 
steps, since the computational cost of sorting is not expensive but not negligible. To 
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Algorithm 7 Calculating the force using the sorted array and reduction of divisions 

1: for i = 1 to iV - 1 do 

2: Qkey ^ 

3: fkey ^ 

4: n i— Key Pointer [i + 1] - Key Pointer [i] 

5: for A; = 1 to [n/2\ do 

6: ka ^ KeyPointcr[i] + 2{k - 1) 

7: ki, ^ KcyPointcr[i] + 2{k - I) + 1 

8: ja ^ SortedList[A;a] 

9: jb SortedList [/cfe] 

10: r„ ^ q[j 

11: Vb ^ q[jb] - Qkey 

12: rl ^ |r„|2 

13: rl |r(,|2 

14: ^ r2 X X 

15: X X 

16: ^r^xr^xrl 

17: r^^ X X 

18: D ^ l/(ri4 X ri4) 

19: fdta ^ [(24r^ - 48) X D X + 8C2J x 

20: fdtb ^ [i24rl -48)xDx r^^ + 802] x dt 

21: fkey ^ fkey + fdta X r^ + /di;, X r^ 

22: p[j„] ^ p[j„] - fdta X r^ 

23: pL^b] ^ Pbb] - /citb X Yb 

24: end for 

25: if n is odd then 

26: calculate force for the last partner particle. 

27: end if 

28: p[i] ^ p[i] + fkey 

29: end for 



mimic a configuration after a very long time evolution without sorting, the indices 
of the particles were completely shufHed at the beginning of the simulation, that is, 
the labels of the particles were exchanged randomly while keeping their positions 
and momenta. The computational speed is measured in the unit of MUPS (millions 
update per second) , which is unity when one million particles are updated in one sec- 
ond. As the number of particles increased, the computational efficiency was greatly 
improved by sorting. The stepwise behavior of the data without sorting reflected 
the cache hierarchy. Since the system is three dimensional, amount of memory re- 
quired to store information of one particle is 48 bytes. Therefore, L2 cache (256) 
KB corresponds to 256 KB/48 bytes ~ 5.3 x 10^ and L3 cache (8MB) corresponds 8 
MB/48 bytes 1.7 x 10^. When the number of particles was smaller than 5.3 x 10^, 
the difference in speed between the data with and without sorting was moderate. 
While the number of particles was larger than 5.3 x 10^, we found that the efficiency 
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(a) Before Sorting (b) After Sorting 




1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 |l0|l1 |l2|l3|l4| 



Fig. 7. Sorting the indices of the particles to increase cache efficiency. There are four ceUs and 14 
particles in the system, (a) Status of the system before sorting. The array of grid information 
corresponding to this configuration is shown below (see Fig. [3} . (b) Indices of the particles are 
sorted using the grid information. 



3.5 




10^ 10^ 10'* 10^ 10^ 

Particles 

Fig. 8. Computational speed with and without sorting. The speed is greatly improved by sorting 
for a large number of particles A'^. Dashed vertical lines denote A'^ — 5.3 x 10'' and ~ 1.7 x 10^ . 
The behavior changes at the lines reflecting the cache hierarchy. 



was improved by 30 to 40% by sorting. When the number of particles was larger 
than 1.7 X 10^, the computational speed without sorting became much lower than 
that with sorting. We found that the effect of the sorting in efficiency was about a 
factor of 3 in the case of 10^ particles. 

3.4. Software Pipelining 

The elimination of the conditional branch can be inefficient in some types of 
CPU architectures such as Intel or AMD, since the cost of the conditional branch is 
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Pipeline stages 
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».|2 
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r < r, 



Pp 



Pp - dfr 



NOP 



(a) Usual Calculation 



Descriptions 

calculate distance between two particles 
calculate the square of the distance 
compare the distance and the cutoff 
calculate the force using the distance 
update the momentum of the particle 

no operation (distance > cutoff) 
(b) Software pipelined 
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Fig. 9. Software pipelining. Calculation of the force consists of five parts: DIS, SQR, CMP, 
CFD, and UMP. (a) Implementation without software pipelining. All calculation are performed 
sequentially, (b) Implementation with software pipelining. The order of the instructions are 
arranged in order to improve the throughput of the instructions. 



not expensive. In such CPU architectures, software pipeUning is effective in improv- 
ing the computational speed. Recent CPUs can handle four or more floating-point 
operations simultaneously if they are independent. The force calculation of one pair, 
however, consists of a sequence of instructions that are dependent on each other. 
Therefore, CPU sometimes has to wait for the previous instruction to be completed, 
which decreases the computational speed. Software pipelining is a technique to re- 
duce such dependencies between instructions in order to increase the efficiency of 
the calculation. The purpose of the software pipening is increasing a number of 
instructions which can be executed simultaneously, and consequently, increasing the 
instruction throughputs in a hardware pipepine. While the software-pipelining tech- 
nique is often used in combination with loop unrolling, the simple loop unrolling can 
cause a shortage of registers leading to access to the memory which can be expensive. 
The main idea of software pipelining is to calculate independent parts of the force 
calculation simultaneously by shifting the order of the instructions. The calculation 
of force can be broken into the following five stages; 
DIS calculate distance between two particles. 
SQR calculate the square of the distance. 
CMP compare the distance and the cutoff length. 
CFD calculate the force using the distance. 
UMP update the momentum of the particle. 

The schematic image of the software pipelining is shown in Fig. [9l The number 
after a stage denotes the index of the pair (the value of the loop counter), e. g., 
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Algorithm 8 Calculating the force with software pipelining 

1: for i = 1 to - 1 do 

2: Qkey ^ 

3: Pkey ^ 

4: fdt ^ 

5: k KeyPointer[i] 

6: ja SortedList[fe] 

7: Ta ^ q[ja] - Qkey 

8: n^o 

9: jb ^ 

10: for k = KeyPointer[z] to KeyPointer[i + 1] -1 do 

11: r^ra 

12: ^ |rp 

13: i ^ ja 

14: ^ SortedList[k+l] 

15: Ta ^ q[ja] - Qkey 

16: if < then 

17: Pkey ^ Pkey + fdt X 

18: p[jb] ^ p[jb] - fdt X Tb 

19: /dt ^ [(24r'^ - 48) /r^'' + 8C2] x 

20: jb ^ j 

21: Tfc ^ r 

22: end if 

23: end for 

24: p[jb] ^ p\jb] - fdt X Tb 

25: p[i] ^ pW + Pkey + fdt X Tb 

26: end for 



DISl denotes the calculation of the distance of the first particle pair. Figure (a) 
denotes the calculation without software pipelining. Suppose the distances between 
third and fifth pairs are longer than the cutoff length. Then the calculation of the 
force and the update the momentum are unnecessary. We refer this to NOP (no 
operation). All calculations are performed sequentially. For instance, SQRl is per- 
formed after DISl is completed, CMPl is performed after SQRl is completed, and 
so on. Figure [9] (a) denotes the calculation with software pipelining. The order 
of the calculations are shifted so that the number of independent calculations are 
increased. For instance, DIS4, SQR3, and UMPl can be performed simultaneously 
since there are no dependencies between them (see the column where the value of the 
loop counter is three). Increase of independent instructions increases the efficiency 
of the hardware pipeline, and consequently, impoves the computational speed The 
pseudo code corresponding to Fig. [9] is shown in Algorithm [HI As far as we know in 
our experience, the software pipelining without loop unrolling described here is the 
best choice for recent architecture such as Intel Xeon processors. 
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Total System 
(Parallel Environment) 

Node 

Node 

Node 

Node 

Fig. 10. Parallel environment. A total system of a supercomputer consists of several nodes. A 
node consists of several CPUs. A CPU consists of several cores. One or two processes are 
executed on a core. 



§4. Parallelization 

4.1. Parallelization Scheme 

Since the size of the system or computational steps are hmited on the single 
process job, parallel computation is required in order to simulate large systems or 
long time steps. Parallel computation is executed on a parallel environment such as 
a supercomputer. A total system of a supercomputer has hierarchic structure, i.e., 
a total system consists of nodes, a node consists of CPUs, a CPU consists of cores 
(see Fig. [T0|) . One or two processes is usually executed on a single core, and a set of 
processes work concertedly on the same task. 

There has been considerable effort devoted to the parallelization of MD simula- 
tions. The parallel algorithms used for MD can be classified into three strategies, do- 
main decomposition, particle decomposition, and force decomposition.^ In domain 
decomposition (also called spatial decomposition) the simulation box is subdivided 
into small domains, and each domain is assigned to each process. In particle de- 
composition, the computational workload is divided and distributed on the basis of 
the particles. Different processes have different particles. In force decomposition the 
computational workload is divided and distributed on the basis of the force calcula- 
tion. Consider a matrix Fij that denotes the force between i- and j'-particles. Force 
decomposition is based on the block decomposition of this force matrix. Each block 
of the force matrix is assigned to each process. Generally, domain decomposition 
is suitable for systems with short-range interactions and simulations with a large 
number of particles ,123 '123 and force decomposition is suitable for systems with long- 
range interactions such as electrostatic force and simulations with a large number of 
steps. In the present paper, we adopt a simple domain decomposition method for 
parallelization, i.e., we divide a system into small cubes of identical size. We use a 
Message Passing Interface (MPI) for all communication. To perform the computa- 
tion, three types of communication are necessary: (i) exchanging particles that stray 
from the originally associated process, (ii) exchanging information of particles near 
boundaries to consider the cross-border interactions, and (iii) synchronization of the 
validity check of pair lists. 
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(a) x-direction (b) y-direction (c) z-direction (d) transfer of particles 




Fig. 11. Sending and receiving of positions of particles that are close to boundaries, (a) Sending 
along the a;-direction. First, each process sends/receives the positions of particles near the right 
boundary to/from the right neighbor. Similarly, the positions of particles near the left boundary 
is transmitted, (b) Sending along the y-direction. Positions of particles near the boundaries 
is sent to neighbors, (c) Sending along the z-direction is performed similarly, (d) Transfering 
particles. Suppose there are three neighboring domains associated with Process A, B, and C. 
First Process A sends a particle to Process B along x-axis. Process B sends the particle recieved 
from Process A to Process C. Then Process C recieves particles from Process A via Process 
B. Thus, each process obtains the information of particles from its diagonal neighbors without 
direct communication. 

4.2. Exchanging Particles 

Although all particles are initially in the domains to which they are placed, they 
tend to stray from their original domains as a simulation progresses. Therefore, 
each process should pass the migrated particles to an appropriate process at an 
appropriate time. There is no need for exchanging particles each step since it will 
not cause problems while migration distances are short. Therefore, migrated particles 
are exchanged simultaneously with a reconstruction of pair list. 

4.3. Cross-Border Interactions 

To calculate forces between particle pairs which are assigned to different pro- 
cesses, communication between neighboring processes is necessary. While a naive 
implementation involves communication in 26 directions since each process has 26 
neighbors, the number of communications can be reduced to six by forwarding in- 
formation of particles sent by other processes. Suppose a process is assigned the 
cuboidal domain Sx < x < Cx, Sy < y < Cy, and Sz < z < Cz, and the search length 
is denoted by rg, which is defined in Sec. 12. 3[ The procedures to send and receive 
positions of particles on the boundaries are as follows. 
1: Left: Send positions of particles in the region s^; < x < s^^ -|- rg to the left process, 

and receive positions of particles in the region Sx — rg < x < Sx from the left 

process. 

2: Right: Send the positions of particles in the region ex — rs < x < Cx, and receive 
positions of particles in the region Cx ^ x ^ Cx ~t~ from the right process. 

3: Backward: Send positions of particles in the region Sy < y < Sy + r^ to the 
backward process, including the particles received from the left and the right 
process. Receive positions of particles in the region Sy — rg < y < Sy from the 
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backward process. 

4: Forward: Send the positions of particles in the region Cy — < y < Cy to the 
process in front, including the particles received from left and the right processes. 
Receive positions of particles in the region Cy < y < Cy + Vs from the process in 
front. 

5: Down: Send the positions of particles in the region Sz < z < Sz + Vg to the lower 
process, including the particles received from other processes. Receive positions 
of particles in the region Sz — Ts < z < Sz from the lower process. 
6: Up: Send positions of particles in the region — rg < z < from the upper 
process, including the particles received from other processes. Receive positions 
of particles in the region < z < + rg from the upper process. 
After the above six communications, exchanging positions of particles close to bound- 
aries are completed including between diagonal processes. The details are illustrated 
in Fig. [TTJ Note that the simple implementation of the communication may cause 
deadlock which is a situation that two or more processes will wait responce forever. 
Suppose there are two processes, A and B. If A tries to send a message to B, and B 
also tries to send to A with MPI blocking communications, then the communication 
will not be completed. Similarly, suppose there are three processes. A, B, and C. If 
the communication path are A ^ B, B — )• C, and C — )■ A, then the system is getting 
into a deadlock. In the domain decomposition parallelism with periodic boundary 
condition, it is possible that a deadlock occurs. There are several techniques to 
avoid this problem, such as reversing the order of sending and receiving and using 
nonb locking operations. A simple way to avoid such a deadlock problem is to use 
the MPI_Sendrecv function. This function does not cause deadlock for closed-path 
communications. The pseudocode using MPLSendrecv is shown in Algorithm [9l 
SendDir and RecvDir denote the direction that the process should send to or re- 
cieve from. There are six directions to complete the communication. The number 
of particles to be sent in each direction is fixed until the expiration of a pair list. 
Suppose Ns is a number of particles to send. Then the number of data sent is SNg, 
since only coordinates are required to calculate forces. The message tag denoted by 
Tag, which is an arbitrary non-negative integer but it should be the same for sender 
and receiver. 

4.4. Synchronization of pair list Constructions 

Each process maintains its own pair list and has to reconstruct it when it expires. 
If each process checks the validity of the pair list and reconstructs it independently, 
since it causes idle time, and the calculation effiency dicreases. To solve this problem, 
we synchronize the validity check of pair lists for all processes, that is, we reconstruct 
all the pair lists even when one of the pair lists has expired. While this reduces 
the average lifetime of the pair lists, the overall speed of the simulation is greatly 
improved since the idle time is eliminated (see Fig. I12p. This synchronization process 
can be implemented with MPI_Allreduce, which is a global-reduction function of the 
MPI. Note that the global synchronization of pair lists can have a serious effect on the 
parallel efficiency when a system is highly inhomogeneous such as a system involving 
phase transitions or a heat-conducting system. In such cases, further advanced 
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Algorithm 9 Communications for cross border interactions 

1: for SendDir in [left, right, backward, forward, down, up] do 
2: RecvDir ■h- the opposite direction of SendDir 
3: SendBuf ^ coordinates of particles to send 
4: A^s number of particles to send 
5: Nr number of particles to receive 
6: Prepare RecvBuf 

7: DestRank ^ rank of neighbor on the SendDir. 
8: SrcRank rank of neighbor on the RecvDir. 

9: Call MPl_SendTecY{SendBuf, 3A^s, MPI_DOUBLE, DestRank, Tag, 

RecvBuf, 3Nr, MPLDOUBLE, SrcRank, Tag, MPLCOMM.WORLD) 
10: Update coordinates of particles using RecvBuf 

11: end for 




: ■ ' ■ ■ '■■'■■'■■'■\ Force Calculation 
KSSSSSSS Pair List Construction 
^^^^^^^^ Idle Time 



Fig. 12. Synchronization of pair lists. The numbers in the boxes of force calculation denote the 
time steps. The size of the boxes of force calculation and the pair list construction reflect the 
computational time required for them, i.e., the size of the box is proportional to the computa- 
tional time. Suppose the lifetime of the pair list of Process 1 is 3 steps, that is, the pair list 
should be reconstructed after three force calculations. Similarity, the lifetime of the pairlists are 
4 and 5 for Process 2 and 3, respectively, (a) Without synchronization, if the processes recon- 
struct pair lists independently, then other processes must wait for the reconstruction resulting 
in a huge amount of idle time, (b) With synchronization, if the processes synchronize their own 
pair lists with others, then the idle time vanishes while the average lifetime of pair lists also 
decreases. 



treatment that manages pair lists locally at each location of grid would be required. 

4.5. Results of Benchmark Simulations 

To measure the efficiency of the developed code, we perform benchmark simu- 
lations on two different parallel computers located at the different facilities, which 
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Facility 


Name 


CPU 


Cores Processes 


Memory 


NIFS 


HITACHI SR16000/J2 


IBM P0WER6 (4.7 GHz) 


32 64 (*) 


128GB 


ISSP 


SGI Altix ICE 8400EX 


Intel Xeon X5570 (2.93 GHz) 


8 4 or 8 (**) 


24GB 



Table I. Summary for HITACHI SR16000/J2 at the National Institute for Fusion Science (NIFS) 
and SGI Altix ICE 8400EX at the Institute for Solid State Physics (ISSP). Location, name, 
CPU, number of cores per node, number of processes per node, and memory per node are 
shown. (*) Using simultaneous multithreading (SMT), 64 processes are executed on each node. 
(**) The single-node run is performed with four processes on one CPU and two-or-more-node 
runs are performed with 8 processes per node on two CPUs because of the queue design at ISSP. 



are NIFS and ISSP. The details of the computers are listed in Table HI In the actual 
analysis we used the so-called week scaling analysis, which is the dependence of the 
execution time in terms of node number under the condition of the number of par- 
ticle per process keeping constant. The conditions for the benchmark simulations 
are as follows. Note that, all quantities are normarized by well depth e and atomic 
diameter a of the Lennards-Jones potential defined in Sec. 12.11 

• System size: 100 x 100 x 100 for each process. 

• Number of particles: 500,000 for each process (number density is 0.5). 

• Initial condition: Face-centered-cubic lattice. 

• Boundary condition: Periodic for all axes. 

• Integration scheme: Second-order symplectic integration. 

• Time step: 0.001. 

• Cutoff length: 2.5. 

• Search length: 2.8. 

• Cutoff scheme: Add constant and quadratic terms to potential as shown in 
Eq. 

• Initial velocity: The absolute value of the velocities of all particles are set to be 
0.9 and the directions of the velocities are given randomly. 

• After 150 steps, we measure the calculation (elapsed) time for the next 1000 
steps. 

When a system with N particles is simulated for k steps in t seconds, then the 
number of MUPS is given by xNk/lO^t. At NIFS, we placed 4x4x4 cubes at each 
node, and assigned one cube to each process (see Fig. [T3]) . since each node has 32 
cores, and we placed 64 processes on each node utilizing simultaneous multithreading 
(SMT). At ISSP, we placed 2x2x2 cubes at each node, and assigned one cube to 
each process, since one node consists of two CPUs (8 cores) at ISSP. 

The results obtained at NIFS and ISSP are summarized in Table Hll and Table Hill 
respectively. Both results are shown in Fig. [TH While the elapsed time of the 2-node 
calculation at NIFS is 302.6 s, that of the 128-node calculation is 412.7 s. Similarly, 
while the elapsed time of the single node calculation at ISSP is 183.13 s, that of the 
1024-node calculation is 272.478 s. Since we perform weak scaling, the elapsed time 
should be independent of the number of processes provided the computations are 
perfectly parallelized. Therefore, the increase in elapsed time of the calculation with 
larger nodes is regarded as being due to the parallel overhead. 
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1 Process 1 Node Total System 




0.5 Million Particles 64 Processes 8 Nodes 



Fig. 13. Domain decomposition for parallelization at NIFS. Each process is assigned a cube with 
a length of 100 that contains 500,000 particles. Each node is assigned to a large cube made up 
of 4 X 4 X 4 small cubes. The case of eight nodes is shown as an example. 



Nodes 


Processes 


Particles 


Elapsed Time [s] 


Speed [MUPS] 


Efficiency 


2 


128 


64,000,000 


302.623 


211.484 


1.00 


4 


256 


128,000,000 


305.469 


419.027 


0.99 


8 


512 


256,000,000 


309.688 


826.638 


0.98 


16 


1024 


512,000,000 


325.744 


1571.79 


0.93 


32 


2048 


1,024,000,000 


333.140 


3073.78 


0.91 


64 


4096 


2,048,000,000 


385.765 


5308.93 


0.78 


128 


8192 


4,096,000,000 


412.717 


9924.48 


0.73 



Table II. Results of benchmark simulations at NIFS. Numbers of nodes, numbers of processes, 
numbers of particles, elapsed times [s], and speeds [MUPS] are listed. Efficiency is estimated 
based on the elapsed time obtaind by the case of the 2-node calculation, since a single-node 
calculation is not permitted at NIFS. 



Nodes 


Processes 


Particles 


Elapsed Time [s] 


Speed [MUPS] Efficiency 


1 


2000000 


183.13 


10.9212 


1.00 


2 


8000000 


186.296 


42.9424 


0.98 


4 


16000000 


187.751 


85.2194 


0.98 


8 


32000000 


190.325 


168.133 


0.96 


16 


64000000 


194.932 


328.32 


0.94 


32 


128000000 


203.838 


627.95 


0.90 


64 


256000000 


224.028 


1142.71 


0.82 


128 


512000000 


228.485 


2240.84 


0.80 


1024 


4096000000 


272.478 


15032.4 


0.66 



Table III. Same as Table [III for ISSP. Efficiency is determined relative to the elapsed time for the 
single-node calculation. 

§5. Parallel Overhead 

The benchmark results described in the previous section show that the paral- 
lelization efficiency decreases as a number of processes increases, while the ammount 
of communication in one step seems to be negligible compared with the computa- 
tions. In the present section, we investigate the cause of the parallel overhead of our 
codes in order to identify the factor of the decrease in parallelization efficiency. All 
simulations presented here are performed under the same condition as that used for 
the benchmarks in Sec. 14.51 
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Fig. 14. Results of benchmark simulations. The 
ISSP by filled symbols. ET and Eff denotes 
increase in elapsed time is due to the parallel 



results at NIFS are denoted by open symbols and 
elapsed time [s] and efficiency, respectively. The 
overhead. 



5.1. Granularity 

In parallel computing, the ratio of computation to communication is important. 
This ratio is called the granularity. Fine-grain parallelism refers to the situation that 
data are transferred frequently and computational tasks are relatively light compared 
with communication. Coarse-grained parallelism, on the other hand, refers to the 
situation that computational tasks are relatively heavy compared with communica- 
tion. In general, a program having coarse granularity exhibits better performance 
for a large number of processes. Here, we estimate the granularity of our code. 
Although the interaction length is set to 2.5, each process has to send and receive 
information of particles whose distances from boundaries are less than 2.8, since 
we adopt the DUTC method and the search length is set to 2.8. Each process is 
assigned to a cube of length 100 and the number density is set to 0.5. The commu- 
nications are performed along the x-, y-, and z-directions, and two communications 
are involved for each direction. The amount of transmitting data increases in order 
of the X-, y-, and z- directions, since the received data of particles are forwarded. 
The position of each particle is expressed by three double-precision floating-point 
numbers, requiring 3 x 8 = 24 bytes. The amount of communication along y-axis is 
larger than that along x-axis, since the information of particles recieved along x-axis 
are transfered to y-axis (see Fig. [TTj) . The number of particles transmitted along 
the X-axis is 100 x 100 x 2.8 x 0.5 = 14, 000. Therefore, 336 KB are transmitted 
along the x-axis. Similarly, (100 + 2.8 x 2) x 100 x 2.8 x 0.5 x 24 = 355 KB of 
data are transmitted along the y-axis and 375 KB of data are transmitted along the 
z-axis. Since inner-node communication is performed on shared memory and should 
be sufficiently faster than internode communication, we only consider the internode 
communication. Each node at NIFS has 4 x 4 x 4 = 64 cubes, and therefore, 16 



Efficient Implementations of MD Simulations for LJ System 



25 



processes involve internode communication. The largest amount of communication 
is along the z-axis, which is about 375 KB xl6 = 6 MB. The bandwidth between 
each node at NIFS is about 5 GB/s. The total amount of internode communication 
is at most 36 MB, and therefore, the time spent on communication will be about 7.2 
ms for each step. The computational time without communication is about 0.3 s, 
which is 500 times longer than the estimated communication time. This implies that 
this code has coarse-grained parallelism and should be less affected by increasing the 
number of processes. The benchmark results, however, show a significant decrease 
in parallel efficiency for large-scale runs. 

We have also investigated the communication set up time at Kyushu University. 
The supercomputer at Kyushu University is HITACHI SR16000/J2 which is same 
as that at NIFS. But it is 42 nodes while NIFS has 128 nodes. We observed the 
communication set up time for MPI_Barrier and MPI_Allreduce over full of nodes, 
and found that the set up times are at most 50 us and 130 us, respectively. Therefore, 
communication cannot be a main factor of the parallel overhead. 



Nodes 


Total [s] 


Force [s] 


pair list [s] 


Communication [s] 


2 


304.609 


217.403 


38.26566 


12.54 


128 


404.335 


217.733 


41.85143 


14.39 



Table IV. Breakdowns of the computational time at NIFS. Total, Force, pair list, and Commu- 
nication denote the total time of computation [s], the time spent on force calculations [s], the 
time required to construct pair lists [s], and communication time [s], respectively. 



Nodes 


Total [s] 


Force [s] 


pair list [s] 


Communication [s] 


1 


180.945 


144.92 


14.388425 


4.12 


256 


228.614 


144.638 


16.299467 


7.92 



Table V. Same as Table HVl for ISSP. 



5.2. Lifetime of Pair Lists 

Since we perform weak scaling analysis, the size of the system increases as the 
number of nodes increases. This can decrease the lifetime of pair lists, since the max- 
imum speed of the particles increases in a larger system and faster particles reduce 
the lifetime of pair lists.'^Sl' The construction of a pair list takes much longer time 
than the force comutation takes, and therefore, the simulation run time is strongly 
affected if the number of pair lists being constructed increases. This increases the 
simulation run time even if the simulation is perfectly parallelized, and therefore, 
it is inappropriate to regard this cost as the parallelization cost. Therefore, we in- 
vestigate the system-size dependence on the time required for construction of pair 
lists. At NIFS, the pair list was reconstructed 20 times for the 2- node run while it 
was reconstructed 21 times for the 128-node run. The average life time is 51.2 steps 
for 2-node and 53.8 steps for 128-node caluclations. Therefore, the computational 
cost to construct pair lists increases for larger number of nodes, but the difference is 
small. The breakdowns of the computational time for the smallest and the largest 
runs are listed in Table ITVl and IVl and are also shown in Fig. [TJl The additional cost 
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Time [SEC] 




(a) NIFS 

2 nodes 



1 28 nodes 



(b) ISSP 

1 node 



256 nodes 



Fig. 15. Breakdowns of the computational time for (a) NIFS and (b) ISSP. The smallest and largest 
runs are shown for both cases. Although the communication time increases as the number of 
nodes increases, the increase in time for communication is at most 8% of the total increase, 
which cannot explain the observed parallel overhead. The idle time due to the synchronization 
is counted as "Others". 



of the pair list construction at NIFS is only 4 [s] which cannot explain the parallel 
overhead, which is over 100 [s]. Therefore, the lifetime of pair lists is not the main 
factor causing the reduced parallel performance. 

5.3. Communication Management 

Each simulation involves two types of communication at each step. One is point- 
to-point communication for sending and receiving data of particles, the other is 
collective communication for the synchronization of pair lists. The point-to-point 
communication is implemented in a simple way, that is, all processes send data in 
one direction and receive data from the opposite direction. All processes send and re- 
ceive data simultaneously in the same direction using MPLSendrecv (see Fig. 1161(a)). 
While the simple implementation of the communication may cause deadlock for the 
system with THE periodic boundary condition, the function MPLSendrecv avoid the 
deadlock implicitly by reordering of communication, which may cause some overhead 
in parallel efficiency for a large number of processes. Therefore, we examine different 
ways of communication which avoid the deadlock explicitly in order to investigate 
whether the communication management by MPLSendrecv is the factor of the par- 
allel overhead. Here we performed three methods. The first method, referred to as 
'Oneway', is the simple implementation described in Sec. 14.31 The second method, re- 
ferred to as 'Two-step Oneway', involves communications in 2 stages (see Fig. 1161(b)). 
All processes are classified into two groups, even and odd. In the first of the two 
stages, even processes send data and odd processes receive data. In the second stage. 
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(a) Oneway (b) Two-step Oneway (c) Painwise 




Fig. 16. Communication methods, (a) Oneway: All processes send information in the same direc- 
tion, and receive the data simultaneously using MPl_Sendrecv. (b) Two-step Oneway: Although 
all processes send information in the same direction, the sending and receiving processes are 
explicitly separated using MPI_Send and MPI_Recv. (c) Pairwise: Each process is classified as 
even or odd. When the even processes exchange the positions of particles with their left neigh- 
bors, the odd processes exchange the positions of particles with their right neighbors. Then the 
directions of communication are reversed. Communication is implemented by MPI_Sendrecv. 



data are transmitted in the reverse direction. The communications are implemented 
by the blocking functions MPLSend and MPLRecv. The third method, referred to 
as 'Pairwise', arranges the order of communication explicitly. All processes are clas- 
sified into two groups, even and odd, similarly to in the previous method. When the 
even processes exchange data with their right neighbors, the other processes exchange 
data with their left neighbors, and vice versa (see Fig. [16] (c)). The communications 
are implemented by MPI_Sendrecv. Note that, the behavior of 'Pairwise' at ISSP is 
not monotonic, i.e., the data of 128-node run is much faster than those of 8-, 16-, 
32-, and 64-node runs. This may be due to the network topology at ISSP which is 
hypercube, however, the details should be investigated later. 

A comparison of benchmark test for above the three methods is shown in Fig. [T71 
Figure [TTl (a) shows the results at NIFS. The results of Pairwise only slightly differ 
from those of Oneway. While the results of Two-step Oneway show similar per- 
formance up to 32 nodes, those for larger nodes are significantly slower than the 
other methods. Figure [T71 (b) shows the results at ISSP. There are major differences 
between the performances of the three methods even though the amounts of com- 
munication are identical, and Two-step Oneway considerably slower than the other 
methods. The reason why Two-step Oneway is the slowest method in both cases 
may be that double threads are executed by separating the sending and receiving 
processes instead of using MPLSendrecv. While it is unclear why the results of the 
three methods are almost the same for the largest number of nodes at ISSP, we can 
conclude that the overhead of communications strongly depends on the hardware 
environment. Considering the above results at NIFS and ISSP, the simple Oneway 
scheme is found to be less affected by the architecture of parallel machine and con- 
sistently has relatively good performance. 

5.4. Synchronicity Problem 

Parallel programs adopting the single program multiple data (SPMD) model 
usually require synchronicity, and the parallel MD described here is one such exam- 
ple. The slowest process determines the overall performance, since other processes 
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Fig. 17. Comparison between three communication schemes at (a) NIFS and (b) ISSP. The Oneway 
scheme exhibits the best performance at both facihties. 



must wait for the slowest process owing to the synchronization. Therefore, the par- 
ahel efficiency is strongly affected by the fluctuation of the computation time of each 
process. One of the factors causing the fluctuation is load imbalance. Load imbal- 
ance refers to the inhomogeneity of workload among processes. Unequally assigned 
tasks cause the fluctuation of the computational time of each step and they decrease 
the parallel efficiency. The workload of parallel MD mainly depends on the num- 
ber of interacting pairs in each process. To clarify this problem quantitatively, the 
number of interacting pairs is counted at the flnal state of the 128-node run. The 
parallel efficiency is determined by the slowest process which should be with the 
largest number of interacting pairs. 

In our calculation, we found that the largest and the smallest number of pairs 
were 11,656,318 and 11,629,010, respectively. The difference is about 0.02%, which 
cannot be responsible for the reduction in parallel efficiency. Note that this small 
fluctuation originates from the fact that the initial condition is homogeneous. The 
load imbalance can be serious for inhomogeneous systems such as systems involving 
phase transitions. However, our benchmark results show that the parallel efficiency 
decreases as the number of processes increases, even for a homogeneous system with 
a uniform workload for each process. Therefore, we assumed that there are other 
factors that affect the parallel efficiency. 

Next, we investigate the fluctuation of the computational time of force for each 
process. Suppose is the computational time spent for force calculation of p-th 
process at z-th step. The average of the force calculation at i-th step is defined as 

P p 

(i) 

with the number of processes Np. The time evolutions of tave at NIFS and ISSP are 
shown in Fig. [T51 The data of the smallest and the largest runs are presented. Note 
that, we observe the time spent only for force calculation which dose not involve 
communication. We found that the computational time for force show almost same 
values and independent of the number of nodes. However, the compuational time 
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for each step itself might fluctuate, which cause the one of candidate of reduction of 
parallel efficiency. Therefore, we focus on the calculation time of force for each step. 
To investigate the fluctuation of the computational time for each step, we define the 
relative difference as 

t^^ -t^'^ 

fW _ '■max ''min /c ^A 

diff (i) ' V 

t-ave 

where tmax = ™3'Xp{t|^''^} is the maximum time, and t^^^ = minp{t|^''^} is the min- 
imum time spent on the force calculation among all processes at ith step^ ^ff 
denotes the difference between the times spent on the slowest and fastest processes 
normalized by the average time at ith step. The time evolutions of the average and 
relative difference for force calculations at NIFS are shown in Fig. [19] (a). While the 
average times for force calculations are identical for 2- and 128-node runs, the rela- 
tive differences of the large run are significantly larger than those of the small run. 
The relative difference of the large run sometimes fluctuates by 100% (t^jg becomes 
1.0), which implies that the computational time is doubled at the step. The same 
plots for ISSP are shown in Fig. [TH] (b). Similar phenomena are observed, although 
the relative differences of the large run are at most 40%. The relative differences of 
the small run at ISSP are almost zero, since the run has only four processes, which 
is the reason why the fluctuation is relatively small. 

From the above, we can conclude that the main factor degrading the parallel 
efficiency is the fluctuation of the computational time at each step. Although the 
average time varies only slightly, the fluctuation of the computational time between 
processes increases as the number of processes increases. Then the delays of the 
slowest process accumulate as the overhead. There are several possible factors caus- 
ing this fluctuation. One possible source is noise from an operating system (OS). An 
OS has several background processes called daemons, which interfere with the user 
execution time and decreases its efficiency This noise from OS is referred to as OS 
jitter or OS detour. Interruptions by an operating system occur randomly, and they 
cause the fluctuation of the execution time of each process. Not only interruptions 
themselves, but also side effects caused by the interruptions, such as pollution of the 
CPU cache, are also a source of noise. 



§6. Summary and Further Issues 



We have developed an MD code that is not only suitable for massively parallel 
processors but also exhibits good single-processor performance for IBM Power and 
Intel Xeon architectures. Most of all algorithms presented in this paper can be 
applied to other particle systems with short-range interactions including many-body 
interactions. There are three levels of optimization. At the highest level, we attempt 
to reduce the computational cost by developing or choosing appropriate algorithms 
such as the bookkeeping method. Next, we have to manage memory access. All data 
required by cores should be in the cache so that the cores make the best use of its 
computational power. Finally, we have to perform architecture-dependent tuning. 
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Fig. 18. Computational CPU time for force calculations averaged over all processes at each step, 
(a) The data obtained at NIFS. The data obtained from the smallest and largest runs are 
denoted by '2' and '128', respectively, (b) The data obtained at ISSP. The data obtained from 
the smallest and largest runs are denoted by '1' and '256', respectively. Time evolutions are 
almost independent of the number of nodes. 
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Fig. 19. Time evolutions of the relative difference at (a) NIFS and (b) ISSP. Only the final 100 
steps are shown for the visibility. The fluctuations becomes larger as the number of nodes 



mcreases. 



since the development paradigms are considerably different between different CPUs. 

MD is suitable for MPP-type simulations since the computation-communication 
ratio is generally small. However, we have observed a significant degradation in 
parallel efficiency despite the use of conditions under which the communication cost 
should be negligible. We have found that the fluctuation of the execution time 
of processes reduces the parallel efficiency. The fluctuations become serious when 
the number of process becomes over one thousand. The OS jitter, which is the 
interference by the operating system, can be a source of the fluctuation. There 
are many attempts to handle the OS jitter, but it is difficult to be settled by user 
programs. One possibility is to adopt hybrid parallelization, i.e., use OpenMP for 
inner-node communications and use MPI for internodc communications. Hybrid 
parallelization not only reduces the number of total MPI processes, but also averages 
out the fluctuations which can improve the parallel efficiency. Note that the affect 
due to the OS jitter is generally order of nanoseconds or shorter, and the time 
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required to perform one step of MD simulations is order of millisecond. Therefore, 
there may be other sources of noise, such as translation lookaside buffer (TLB) miss. 
TLB is a kind of cache, which translate the address from virtual to reahl^D While it 
makes computation faster, it takes long time to fetch the data if the required address 
is not cached in the buffer. The translation tables are somtimes swapped out into 
other storage such as hard-disk, then the penalty of TLB miss can be expensive. 
Investigation of the source of the fluctuation in execution time is one of the further 
issues. 

In this paper, the load imbalance problem is not considered. Load balancing 
is important for parallelization with a huge number of processes since the fluctua- 
tion of the timing of each process greatly degrades the parallel efficiency as shown 
in Sec. 15.41 There are two approaches to load balancing, one is based on force de- 
composition and the other is based on domain decomposition. The load balancing 
in NAMD (Not Another Molecular Dynamics program is essentially achieved by 
the force decomposition, while NAMD adopts a hybrid strategy of parallelization. 
The processes of NAMD are under the control of a dynamic load balancer. The 
simulation box is divided into subdomains (called patches), and the total compu- 
tational task is divided into subtasks (called compute objects). The load balancer 
measures the load on each process and reassigns compute objects to processors if 
necessary. GROMACS (Groningen Machine for Chemical Simulations )P adopts a 
domain-decomposition-based strategy for load balancing, that is, it changes the vol- 
ume of domains assigned to processors to improve the load balance. The simulation 
box is divided into staggered grids whose volume are different. A load balancer 
changes the volume of each grid by moving the boundaries of cells to reduce load 
imbalance. There is also a method of non-box-type space decomposition that uti- 
lizes the Voronoi construction. Reference points are placed in a simulation box and 
each Voronoi cell is assigned to a process. Load balancing is performed by changing 
the positions of the reference points. Generally speaking, force-decomposition- 
based load balancing is a better choice if a system contains long-range interactions 
such as electrostatic force, and domain-decomposition-based load balancing is better 
for a system with short-range interactions. However, the choice strongly depends on 
the phenomenon to be simulated, and therefore, both strategies or some hybrid of 
them should be considered. 

The source codes used in this paper has been published online.ISS We hope 
that the present paper and our source codes will help researchers to develop their 
own parallel MD codes that can utilize the computational power of petaflop, and 
eventually exaflop machines. 
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Efficient implementations of the classical molecular dynamics (MD) method for Lennard- 
Jones particle systems are considered. Not only general algorithms but also techniques 
that are efficient for some specific CPU architectures are also explained. A simple spatial- 
decomposition-based strategy is adopted for parallelization. By utilizing the developed code, 
benchmark simulations are performed on a HITACHI SR16000/J2 system consisting of IBM 
POWER6 processors which are 4.7 GHz at the National Institute for Fusion Science (NIPS) 
and an SGI Altix ICE 8400EX system consisting of Intel Xeon processors which are 2.93 GHz 
at the Institute for Solid State Physics (ISSP), the University of Tokyo. The parallelization 
efficiency of the largest run, consisting of 4.1 billion particles with 8192 MPI processes, is 
about 73% relative to that of the smallest run with 128 MPI processes at NIFS, and it is 
about 66% relative to that of the smallest run with 4 MPI processes at ISSP. The factors 
causing the parallel overhead are investigated. It is found that fluctuations of the execution 
time of each process degrade the parallel efficiency. These fluctuations may be due to the 
interference of the operating system, which is known as OS Jitter. 

§1. Introduction 

The classical molecular dynamics (MD) simulation was first performed by Alder 
and Wainwright.'D' The original MD was used to investigate the hard-particle sys- 
tem, and an event-driven algorithm was adopted for the time evolution, which was 
soon followed by a time-step-driven algorithm.!^ Owing to the recent increase in 
computational power, MD is now a powerful tool for studying not only the equilib- 
rium properties but also the nonequilibrium transportation phenomena of molecular 
system^® 'IS} 'EJ as well as biomolecular systems .'^''^ Electronic degrees of freedom 
can also be considered by ah initio methods on the basis of quantum mechanics.'^ 
The increase in computational power also allows us not only to use more realistic 
and complex interactions but also to treat a larger number of particles. The recent 
increase in computational power has mainly been achieved by increasing the number 
of processing cores. This paradigm is called massively parallel processing (MPP), 
and most recent high-end computers have been built on the basis of this paradigm.^I^ 
The number of cores is typically from several thousand to several hundred thousand. 
In an MPP system, a single task is performed by a huge number of microproces- 
sors, which operate simultaneously and communicate with each other as needed. 
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Therefore, parallelization is now unavoidable in order to utilize the computational 
power of such machines effectively. While parallelization itself is of course impor- 
tant, the tunings for a single core have become more important for huge-scale and 
long-time simulations, since the overall performance of MD mainly depends on the 
performance on the single cores. In particular, optimizing memory access is impor- 
tant. The latency and bandwidth of memory access are sometimes slow compared 
with the execution speed of processors, and the data supply often cannot keep up 
with the requests by the processors. Therefore, data must be suitably arranged so 
that necessary data is located near a processor. Tuning for specific architectures is 
also important to achieve high performance. Since design concepts differ from one 
processor to another, it is necessary to prepare codes for each architecture, at least 
for the hot spot of the simulation where most time is spent during the execution, 
which is usually force calculation in MD. 

The Next-Generation Supercomputer project is currently being carried out by 
RIKEN.H'' The system is designed on the basis of the MPP paradigm and will use 
scalar CPUs with 128 GFlops, will consist of over 80 000 nodes, and is expected to 
achieve a total computational power of 10 petaflops. This supercomputer is now un- 
der development and will be ready in 2012. Therefore, we must prepare parallelized 
codes that can utilize the full performance of this system by then. The parallelization 
of MD has been discussed over several decades with the aims of treating larger sys- 
tems and performing simulations for longer timescales,^--- and a huge MD simulation 
involving one trillion particles has been performed on BlueGene/L, which consists 
of 212 992 processors Despite this success, it will not easy to obtain a satisfac- 
tory performance on the Next-Generation Supercomputer since the properties of the 
systems are considerably different from those of BlueGene. BlueGene has relatively 
slow processors with a 700 MHz clock speed for BlueGene/L, and consequently, it 
achieves a good balance between the processor speed and the memory bandwidth. 
In comparison, the Next-Generation Supercomputer will have a relatively high peak 
performance of 128 GFlops per CPU (16 GFlops for each core and each CPU consists 
of eight cores). Therefore, the memory bandwidth and latency may be a significant 
bottleneck. 

The purpose of the present article is to describe efficient algorithms and their 
implementation for the MD method, particularly focusing on the memory efficiency 
and the parallel overheads. We also provide some tuning techniques for a couple of 
the specific architectures. While the Lennard-Jones potential is considered through- 
out the manuscript, the techniques described here are generally applicable to other 
simulations with more complicated potentials. This manuscript is organized as fol- 
lows. In Sec. [21 algorithms for finding interacting particle pairs are described. Some 
further optimization techniques whose efficiency depends on architecture of the com- 
puter are explained in Sec. El Parallelization schemes and the results of benchmark 
simulations are described in Sec. HI The factors causing the parallel overhead for 
the developed MD code are discussed in Sec. \5\ and a summary and a discussion of 
further issues are given in Sec. El 
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Fig. 1. Truncation of the Lennard- Jones potential function, (a) Original form. The interaction 
range extends to infinity, (b) Truncation by interpolation. An interpolating function is intro- 
duced for the region from the interpolation point ri to the truncation point rc- (c) Truncation 
by adding extra terms. The additional function f{r) is chosen so that both the value and the 
derivatives of V(r) + /(r) are zero at the truncation point. 



for the particle distance r, well depth e, and atomic diameter a. In the following, 
we will use the physical quantities reduced by a, e, and k^, for example., the length 
scale is measured in the unit of a, and so forth. The first term of the right-hand side 
of Eq. (|2Tp describes the repulsive force, and the other term describes the attrac- 
tive force. This potential form has been widely used for a standard particle model 
involving phase transitions since this system exhibits three phases: solid, liquid, and 
gas. While the original form of the Lennard-Jones potential spans infinite range, it 
is wasteful to consider an interaction between two particles at a long distance since 
the potential decays rapidly as the distance increases. To reduce computation time, 
truncated potentials are usually used instead of the original potential.'^J There are 
several ways of introducing truncation. One way is to use cubic spline interpolation 
for the region from some distance to the cutoff lengthP^'fiSli as shown in Fig. [1] (b). 
In this scheme, there are two special distances, the interpolating distance n and 
the truncation distance Vc- The potential is expressed by a piecewise-defined func- 
tion that switches from V{r) given in Eq. (j2-ip to an interpolating function Vi{r) at 
r = Tj. The interpolation function is chosen so that values and the derivatives of the 
potential are continuous at the interpolation and the truncation points. These con- 
ditions require four adjustable parameters, and therefore, a third-order polynomial 
is usually chosen as the interpolating function. While this interpolation scheme was 
formerly popular, it is now outdated since it involves a conditional branch, which is 
sometimes expensive for current CPU architectures. 



§2. Pair List Construction 



2.1. Truncation of Potential Function 

The common expression for the Lennard-Jones potential is 
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Nowadays, truncation is usually introduced by adding some extra terms to the 
potential. At the truncation point, both the potential and the force should become 
continuous, and therefore, at least two additional terms are necessary. One such 



with two additional coefficients C2 and cq, which are determined so that V{rc) = 
^'(fc) = for the cutoff length rc- Here, we use a quadratic term instead of a linear 
term such as cir + cq, since the latter involves a square-root operation in the force 
calculation, which is sometimes expensive. While the computational cost of the force 
computation is more expensive than that for the interpolating method, this method 
is usually faster than the interpolating method since conditional branches can be 
hazards in the pipeline that obstruct the instruction stream. If accuracy is less 
crucial, one can truncate the potential by adding one constant, i.e., C2 = and cq 
is determined by V{rc) = 0. In this case, the calculation of force is identical to that 
using the original potential and the modification only appears in the calculation of the 
potential energy. Note that this scheme sometimes involves some problems regarding 
the conservation of energy since the force is not continuous at the truncation point. 
The truncation changes the phase diagram. One should check the phase diagram and 
compute phase boundaries for each truncated potential before performing production 
runs. In particular, the gas-liquid coexistence phase becomes drastically narrower as 
the cutoff length decreases. To investigate phenomena involving the gas-liquid phase 
transition, the cutoff length should be made longer than 3.0a. 

2.2. Grid Search Method 

To perform the time evolution of a system, we first have to find particle pairs 
such that the distance between the particles is less than the cutoff length. Since the 
trivial computation leads to 0{N'^) computation, where is the number of particles, 
we adopt a grid algorithm, which is also called the linked-list method,'^''^ to reduce 
the complexity of the computation to 0{N). The algorithm finds interacting pairs 
with in a grid by the following three steps: (i) divide a system into small cells, 
(ii) register the indices of particles on cells to which they belong, and (iii) search 
for interacting particle pairs using the registered information. There are two types 
of grid, one is exclusive and the other is inexclusive (see Fig. [2]). An exclusive grid 
allows only one particle to occupy a cell,'^ and the inexclusive grid allows more than 
one particle to occupy one cell simultaneously.'2I'''i3''iSli''i3 Generally, an exclusive 
grid is efficient for a system with short-range interactions such as hard particles, and 
an inexclusive grid is suitable for a system with relatively long interactions such as 
Lennard-Jones particles. However, the efficiency strongly depends not only on the 
physical properties such as density, polydispersity, and interaction length but also on 
the hardware architecture such as the cache size and memory bandwidth. Therefore, 
it is difficult to determine which is better before implementation. We have found 
that an inexclusive grid is more efficient than an exclusive grid for the region in 
which a Lennard-Jones system involves a gas-liquid phase transition. Therefore, we 
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(a) (b) 




Fig. 2. Searching for interacting pairs in a grid, (a) Exclusive grid. The size of each cell is 
determined so that only one particle can be placed in each cell, (b) Inexclusive grid. The 
interaction length determines the size of cells. More than one particle is allowed to occupy one 
cell simultaneously. 



adopt an inexclusive grid in the following. Note that the length of each cell should be 
larger than the search length rg, which is longer than the cutoff length as described 
later, so that there are no interactions occurring through three or more cells, in other 
words, interacting particle pairs are always in the same cell or adjacent cells. 

A simple way to express the grid information is to use multidimensional arrays. 
Two types of array are necessary, one is for the number of particles in each cell, and 
the other is for the indices of the particles in each cell. Suppose the total number of 
cells is Ug = Ugx X Ugy X Ugz, theu the arrays can be expressed by 

integer: GridParticleNumber[ng2:] [n^y] [n^^] (2 3) 

integer : Gridlndex[nga:] [ugy] [n^^] [g 

maxj ) 

where gmax is the capacity of one cell. The number of particles in a cell at {x, y, z) 
is stored as GridParticleNumber[x][y][z] and the index of the «-th particle at the 
cell is sotred as Gridlndex[x][y][z][i]. Although this scheme is simple, the required 
memory can be dozens of times larger than the total number of particles in the 
system, and therefore, the indices of the particles are stored sparsely, which causes 
a decrease in cache efficiency. To improve the efficiency of memory usage, the in- 
dices of particles should be stored sequentially in an array. This method is proposed 
by Kadau et al. and implemented in SPaSM.I^^ An algorithm to construct grid 
information using an array whose size is the total number of particles is as fol- 
lows. 

1: Divide a system into small cells, and assign a serial number to each cell. 
2: Label particles with the serial numbers of the cells that they belong to. 
3: Count the number of particles in each cell. 
4: Sort the particles by their labels. 
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Fig. 3. How to construct GridList, which is a hnear array of size A'^ storing grid information. 
Suppose there are four cells and 14 particles in a system (A'^ = 14). The left figure shows the 
stored data in the array and the right figure shows the configuration of the particles in the 
system. The numbers with arrows denote pointers indicating where the indices of particles 
should be stored (GridPointer) . The particles have indices from 1 to 14 and the cells are labeled 
from 1 to 4. First, prepare an array of size A'^, and label particles with the serial number of the 
cells to which they belong, (i) Count the number of particles in each cell, store the number as 
GridParticleNumber, and set a pointer for each cell at the position where the index of the first 
particle of each grid should be stored when the particles are sorted by their labels, (ii) Place 
the index of a particle at the position where the pointer corresponding to its label points, and 
move the pointer to the right, (iii) Repeat this procedure for all particles. The status of the 
GridList just after the particle number 6 is located into the GridList. (iv) The completed array 
after above (i) - (iii) procedures. The particles are sorted in GridList in order of the grid labels. 



The details of the implementation are shown in Fig. [3l Suppose that i-cell de- 
notes the cell labeled with i and the grid information is stored in the linear array 
GridList. Then GridParticleNumber [i] denotes the number of particles in i-cell and 
GridPointer [i] denotes the position of which the first particle of i-cell should be stored 
in GridList. The complexity of this sorting process is 0{N) instead of O(A^logA^) 
for usual Quicksort since it does not perform a complete sorting. The memory usage 
is effective since the indices of particles in each grid are stored contiguously in the 
array. 

2.3. Bookkeeping Method 

After constructing GridList, we have to find interacting particle pairs using the 
grid information. Suppose the particle pairs whose distance is less than the cutoff 
length are stored in PairList. PairList is an array of pairs which hold two indicies 
of interacting particles. All particles are registered in cells and labeled by the serial 
numbers of the cells. There are two cases, interacting particles share or do not 
share the same label, which correspond to inner-cell interaction and cell-crossing 
interaction. Let GridList [s..e] denote a subarray of GridList whose range is from s 
to e with the data of the pair list stored in PairList. The pseudocodes used to find 
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Algorithm 1 Finding the interacting particle pairs 



for all z-cell do 
Si ^ GridPointer[i] 
Cj Si+ GridParticleNumber[i] -1 
Lc GridList[sj..ej] 

for all j such that j-cell is the neighbor of i-cell do 

Sj -It- GridPointer[j] 

ej Sj+ GridParticIeNumber[j] -1 

Append GridList[sj..ej] to Lc 
end for 

for k = 1 to GridParticleNumber[i] do 
for / = A; + 1 to \Lc\ do 
m ^ Lc[k] 
n ^ Lc[l] 

if m- and n-particles interact each other then 

Register a pair (m, n) to PairList 
end if 
end for 
end for 
end for 



interacting pairs are shown in Algorithm [TJ Only 13 of the 26 neighbors should be 
checked for each cell because of Newton's third law. 

As the number of particles increases, the cost of constructing the pair list in- 
creases drastically since it involves frequent access to memory. While it depends 
on the details of the system, constructing a pair list is sometimes dozens of times 
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more expensive than the force calculation. In order to reduce the cost of construct- 
ing a pair list, a well-known bookkeeping method was proposedPS The main idea 
of the bookkeeping method is to reuse the same pair list for several time steps by 
registering pairs within a search length rg which is set at longer than the cutoff (in- 
teraction) length Tc (see Fig. U]). The margin rg — Tc determines the lifetime of a 
pair list. While a longer margin gives a longer lifetime, the length of the pair list 
also increases, and consequently, the computational time for the force calculation 
becomes longer. Therefore, there is an optimal length for the margin that depends 
on the details of the system, such as its density and temperature. Several steps after 
the construction of the pair list, the list can become invalid, i.e., some particle pair 
that is not registered in the list may be at a distance shorter than the interaction 
length Tc. Therefore, we have to calculate the expiration time of the list and can 
check the validity of the pair list for each step as the following simple method. The 
expiration time tf. of a pair list constructed at time t is given by 

te = + t, (2-4) 

where Umax is the absolute value of the velocity of the fastest particle in the system. 
The factor 2 in the denominator corresponds to the condition that two particles 
undergo a head-on collision. It is necessary to update the expiration time when 
the maximum velocity changes to 

t,^{te-t)^ + t, (2-5) 

^new 

where and v^ew denote the previous and current maximum speeds, respectively. 
Obviously, the expiration time will be brought forward in the case of a faster particle, 
and vice versa. This technique for calculating the expiration time is called the 
dynamical upper time cutoff (DUTC) method and was proposed by Isobe.'^ 

Although the validity check by considering the maximum velocity is simple, the 
expiration time te determined by the velocities of the particles is shorter than the 
actual expiration time of the pair list, since the pair list is valid as long as the 
particles migrate short distances. Therefore, we propose a method which extends 
the lifetime of a pair list by utilizing the displacements of the particles. The strict 
condition for the expiration of a pair list is that there exists a particle pair {i,j) such 
that 

Iq"'*" - qf I > rs and |q, - q,.| < r,, (2-6) 

where q?''^ is the position of the i-particle when the pair list was constructed and q^ 
is the current position of the z-particle. The lifetime of a pair list can be extended 
if we use the strict condition for the validity check, but the computational cost of 
checking this the condition is 0{N'^). We therefore extend the lifetime of the pair 
list by using a less strict condition. The main idea is to prepare a list of particles 
that have moved by large amounts and check the distance between all pairs in the 
list. The algorithm is as follows: (i) Calculate the displacements of the particles from 
their position when the pair list was constructed and find the largest displacement 
^^'max- (ii) If there exists a particle such that Z\rj > rg — Tc — Z\rmax) then append 
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(i) (ii) (iii) 



Ts - Tc Ld 




Fig. 5. The large displacement check (LDC) method, (i) Calculate the displacements Ari for the 
particle i and find the largest displacement Zimax- (ii) Register a index of a particle i such that 
Ari > Ts — Tc — /i)"max to a list Lo whose size is denoted by |-Ld|. (iii) Old and current position of 
particles i and j. The dashed circle denote the old position which was registered in the previous 
consturction of the pair list. The solid circles denote the current positions of particles. The 
expiration of the pair list is determined by the old and the current distance of the particles. 



the particle to the hst of large displacement Ld which size is denoted by \Lj)\. If 
the list is longer than the predetermined maximum length of the list A'^i, that is, 
|Ld[ > A'^i ,then the pair list is invalidated, (iii) If the old distance is longer than the 
search length and the current distance is shoter than the cutoff length, it means that 
there are interacting particles which are not registered in a pair list. Therefore, the 
pair list is invalidated if there exists a pair (i, j) in the list of large displacement such 
that |q°^'^ — q°'^| > and |qj — q^l < re- We call this method the large displacement 
check (LDC) method. See Fig. O for a schematic description of the LDC method. 

Although the lifetime of the pair list increases with Ni, the computational cost 
also increases as 0{Nf). Therefore, there is an optimal value for Ni. Since the 
LDC method is more expensive than DUTC method, we adopt a hybrid algorithm 
involving both methods, that is, first we use the DUTC method for the validity check, 
and then we use LDC method after the DUTC method decides that the pairlist is 
expired. The full procedure of the validity check is shown in Algorithm [2l 

2.4. Sorting of pair list 

After constructing a pair list, we can calculate the forces between interacting 
pairs of particles. Hereafter, we call the particle of index i the i-particle for conve- 
nience. Suppose the positions of particles are stored in q[A^]. A simple method for 
calculating the force using the pair list is shown in Algorithm [3l This simple algo- 
rithm is, however, usually inefficient since it involves random access to the memory. 
Additionally, it fetches and stores the data of both particles of each pair, which is 
wasteful. Therefore, we construct a sorted list to improve the efficiency of memory 
usage by calculating particles together that interact with the same particles. For 
convenience, we call the lower-numbered particle in a pair the key particle and the 
other the partner particle. The partner particles with the same key particle are 
grouped together. The detail of the implementation is shown in Fig. [6l Suppose the 
data of the sorted list and the position of the first partner particle of the i-particle in 
the list are stored in SortedList and KeyPointer[i], respectively. Then the algorithm 
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Algorithm 2 Checking the vahdity of a pair hst 

1. When a pair list is constructed 

1: for all i-particle do 

2: q^^'^ ^ Qj {Keep the positions} 

3: end for 

4: buffer Jength Ts — rc 

2. Validity check by DUTC method 

1: '-'max ^ thc maxiniTim velocity of the particle 

2: bufferJength ^ bufferJength— 2i;inax^i 

3: if bufferJength > then 

4: the pair list is valid. 

5: else 

6: Check the validity by considering displacements. 

7: end if 

3. Validity check by LDC method 

1: for all i-particle do 

2: z\ri ^ |qf - qj 

3: end for 

4: /Armax ^ max{Z\ri} 

5: Prepare a list Ld 

6: for all i-particle do 

7: if Avi > Vg-Vc- Avraax then 

8: Append i-particle to Ld 

9: if |Ld| > Ni then 

10: the pair list is invalidated. 

11: end if 

12: end if 

13: end for 

14: for all (i,j) in Ld do 

15: if |q°^^ — q^^*^! > Vg and |qj — qj| < Vc then 

16: the pair list is invalidated. 

17: end if 

18: end for 

19: the pair list is invalidated. 



Algorithm 3 Calculating the force in a simple manner 
1: for all pairs in PairList do 
2: if |q[i] - qL?]! < Tc then 
3: Calculate force between i- and j-particles. 
4: end if 
5: end for 
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Fig. 6. A sorted list. A list sorted by the indices of key particles is constructed by the following 
procedure, (i) Count the number of partner particles for each key particle for the particle pairs 
are stored in the pair list, (ii) Prepare a linear array of size A'^pair, and set a pointer for each 
key particle where the index of the first partner particle will be stored, (iii) Similarly to in the 
grid construction (Fig. [S]), place the index of a particle at the pointer of its key particle and 
move the position of the pointer to the next position, (iv) Repeat this procedure for all pairs to 
construct the sorted list. The figure shows two lists, KeyPointer and SortedList. KeyPointer[i] 
stores the first position of the interacting particles in SortedList. 

used to calculate the force using the sorted pair list is shown in Algorithm |H The 



Algorithm 4 Calculating the force using the sorted pair list 
1: for i = 1 to TV - 1 do 

2: Qkey ^ q[«] 

3: fkey ^ 

4: for k = KeyPointer[i] to KeyPointer[z + 1] -1 do 

5: j ^ SortedList [/c] 

6: ^ Iqicey -qbil 

7: / ^ -V'{r) 

8: Update momenta of the j-particle using /. 

9: Update fkey using /. 

10: end for 

11: Update momenta of i-particle using fkey 

12: end for 



variables q^-gy and fkey are intended to be stored in CPU registers. The amount 
of memory access decreases compared with that for Algorithm [3l since the fetching 
and storing are performed only for the data of partner particles in the inner loop. 
The amount of memory access becomes half when the key particles has an enough 
number of partner particles. The force acting on the key particle is accumulated 
during the inner loop, and the momentum of the key particle is updated at the end 
of the inner loop. Note that the loop variable i takes values from 1 to — 1, instead 
of 1 to A^, since this is the loop for the key particle, and the A^-particle cannot be a 
key particle. The value of KeyPointer [A^] should be |PairList| + 1, where |PairList| 
is the number of pairs stored in PairList. 

§3. Further Optimization 

3.1. Elimination of Conditional Branch 

Since we adopt the bookkeeping method, a pair list contains particle pairs whose 
distances are longer than the interaction length. Therefore, it is necessary to check 
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whether or not the distance between each pair is less than the interaction length 
before the force calculation (Algorithm [3]) . These conditional branches (so-called, 
"if then else" of "case" in the high-level language statement) can sometimes be ex- 
pensive for RISC (Reduced Instruction Set Computer)-based processors such as IBM 
P0WER6, and it causes a significant decrease in execution speed. To prevent the de- 
crease in speed due to conditional branches, we calculate the forces without checking 
the distance and assign zero to the calculated value for the particle pairs whose dis- 
tances are longer than the interaction length. This method is shown in Algorithm [5j 
While it appears to be wasteful, it can be faster than the original algorithm when 

Algorithm 5 Calculating the force with if-branch elimination 

1: for k = Ito Afpair do 

2: PairList[/c] 

3: r- ^ |q[i] - q[j]| 

4: / ^ -V'{r) 

5: if r > rc then 

6: /^O 

7: end if 

8: Update momenta of i- and j-particles using /. 

9: end for 



the penalty due to the conditional branches is more expensive than the additional 
cost of calculating forces. Additionally, this loop can be performed for all pairs, 
which makes it easy for a compiler to optimize codes, such as that for prefetching 
data. Note that, some CPU architectures prepare the instruction for the conditional 
substitution. For example, PowerPC architecture has fsel (Floating-Point Select) 
instruction which syntax is fsel FRT, FRA, FTC, FRB where FRT, FRA, FTC, and 
FRB are registers. If FRA > then FRT = FRB, otherwise FRT = FRC. We found that 
this optimization provides us with more than double the execution speed on IBM 
P0WER6 in the benchmark simulation whose details are described in Sec. 14.51 

3.2. Reduction of Divisions 

The explicit expression of Algorithm [3] is shown in Algorithm [6l where C2 is the 
coefficient for the truncation defined in Eq. (j2-2p and dt is the time step. For simplic- 
ity, the diamiter of the particle a is set to unity and only the force calculations are 
shown. As shown in Algorithm[6l the force calculation of the Lennard- Jones potential 
involves at least one division operation for each pair. Divisions are sometimes ex- 
pensive compared with multiplications. Therefore, computations can be made faster 
by reducing the number of divisions. Suppose there are two independent divisions 
as follows; 

A2 ^ I/B2. 

We can reduce the number of divisions by transforming them into 
C ^ l/{Bi X B2), 
Ai^Cx B2, 
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Algorithm 6 Explicit expression of force calculation 
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end for 



A2^C X Bi. 

Here, the number of divisions decreases from two to one, while three multiplications 
appear. This optimization can be effective for architecture in which the penalty for 
divisions is large. We apply this optimization technique to the force calculation of 
the Lennard- Jones system by utilizing loop unrolling, that is, we first obtain two 
independent divisions by unrolling and then we apply this method of reducing the 
number of divisions to the unwound loop. Although one can apply this optimization 
technique to Algorithm [6l it is more efficient to use it together with the sorted array 
described in Sec. 12.41 The algorithm for the force calculation with a reduced number 
of divisions is shown in Algorithm [71 The term [xj denotes the largest integer less 
than or equal to x and subscriptions a and b correspond to the loop unrolled twice. 
This optimization achieves 10% speedup in the condition described in Sec. 14.51 

3.3. Cache Efficiency 

As particles move, the indices of the partner particles of each key-particle become 
random. Then the data of the interacting particles, which are spatially close, may 
be widely separated in memory. This severely decreases the computational efficiency 
since the data of particles that are not stored in the cache are frequently required. 
To improve the situation, reconstruction of the order of the particle indices in arrays 
is proposed.'^ This involves sorting so that the indices of interacting particles are 
as close together as possible. This can be implemented by using the grid informa- 
tion described in Sec. 12.21 After an array of grid information is constructed, the 
particles are arranged in the order in the array. The sorting algorithm is shown in 
Fig. [71 After sorting, the indices of the particles in the same cell become sequential, 
which improves the cache efficiency. Note that the pair list is made invalid by this 
procedure; therefore, the pair list construction and the sorting should be performed 
at the same time. 

A comparison between the computational speed with and without sorting is 
shown in Fig. [SI The simulations were performed on a Xeon 2.93 GHz machine 
with 256 KB for an L2 cache and 8 MB for an L3 cache. We performed the sort- 
ing every ten constructions of a pair list, which is typically every few hundred time 
steps, since the computational cost of sorting is not expensive but not negligible. To 
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Algorithm 7 Calculating the force using the sorted array and reduction of divisions 

1: for i = 1 to iV - 1 do 

2: Qkey ^ 

3: fkey ^ 

4: n i— Key Pointer [i + 1] - Key Pointer [i] 

5: for A; = 1 to [n/2\ do 

6: ka ^ KeyPointcr[i] + 2{k - 1) 

7: ki, ^ KcyPointcr[i] + 2{k - I) + 1 

8: ja ^ SortedList[A;a] 

9: jb SortedList [/cfe] 

10: r„ ^ q[j 

11: Vb ^ q[jb] - Qkey 

12: rl ^ |r„|2 

13: rl |r(,|2 

14: ^ r2 X X 

15: X X 

16: ^r^xr^xrl 

17: r^^ X X 

18: D ^ l/(ri4 X ri4) 

19: fdta ^ [(24r^ - 48) X D X + 8C2J x 

20: fdtb ^ [i24rl -48)xDx r^^ + 802] x dt 

21: fkey ^ fkey + fdta X r^ + /di;, X r^ 

22: p[j„] ^ p[j„] - fdta X r^ 

23: pL^b] ^ Pbb] - /citb X Yb 

24: end for 

25: if n is odd then 

26: calculate force for the last partner particle. 

27: end if 

28: p[i] ^ p[i] + fkey 

29: end for 



mimic a configuration after a very long time evolution without sorting, the indices 
of the particles were completely shufHed at the beginning of the simulation, that is, 
the labels of the particles were exchanged randomly while keeping their positions 
and momenta. The computational speed is measured in the unit of MUPS (millions 
update per second) , which is unity when one million particles are updated in one sec- 
ond. As the number of particles increased, the computational efficiency was greatly 
improved by sorting. The stepwise behavior of the data without sorting reflected 
the cache hierarchy. Since the system is three dimensional, amount of memory re- 
quired to store information of one particle is 48 bytes. Therefore, L2 cache (256) 
KB corresponds to 256 KB/48 bytes ~ 5.3 x 10^ and L3 cache (8MB) corresponds 8 
MB/48 bytes 1.7 x 10^. When the number of particles was smaller than 5.3 x 10^, 
the difference in speed between the data with and without sorting was moderate. 
While the number of particles was larger than 5.3 x 10^, we found that the efficiency 
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(a) Before Sorting (b) After Sorting 




1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 |l0|l1 |l2|l3|l4| 



Fig. 7. Sorting the indices of the particles to increase cache efficiency. There are four ceUs and 14 
particles in the system, (a) Status of the system before sorting. The array of grid information 
corresponding to this configuration is shown below (see Fig. [3} . (b) Indices of the particles are 
sorted using the grid information. 



3.5 




10^ 10^ 10'* 10^ 10^ 

Particles 

Fig. 8. Computational speed with and without sorting. The speed is greatly improved by sorting 
for a large number of particles A'^. Dashed vertical lines denote A'^ — 5.3 x 10'' and ~ 1.7 x 10^ . 
The behavior changes at the lines reflecting the cache hierarchy. 



was improved by 30 to 40% by sorting. When the number of particles was larger 
than 1.7 X 10^, the computational speed without sorting became much lower than 
that with sorting. We found that the effect of the sorting in efficiency was about a 
factor of 3 in the case of 10^ particles. 

3.4. Software Pipelining 

The elimination of the conditional branch can be inefficient in some types of 
CPU architectures such as Intel or AMD, since the cost of the conditional branch is 
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Pipeline stages 
DIS ^ 



SQR 



CMP 



CFD 



UMP 



qp - Qk 

».|2 



2 ^ 

r < r, 



Pp 



Pp - dfr 



NOP 



(a) Usual Calculation 



Descriptions 

calculate distance between two particles 
calculate the square of the distance 
compare the distance and the cutoff 
calculate the force using the distance 
update the momentum of the particle 

no operation (distance > cutoff) 
(b) Software pipelined 



Loop Counter 


Loop Counter 


DIS1 




1 


DIS1 


SQR1 


CMP1 


CFD1 


UMP1 
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DIS2 


S0R1 


CMP1 


CFD1 


NOP 
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DIS2 


SQR2 


CMP2 


NOP 


NOP 
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DIS3 


S0R2 


CMP2 


NOP 


NOP 


3 


DIS3 


SQR3 


CMP3 


CFD3 


UMP3 


3 


DIS4 


SQR3 


CMP3 


CFD3 


UMP1 


4 


DIS4 


SQR4 


CMP4 


NOP 


NOP 


4 


DIS5 


S0R4 


CMP4 


NOP 


NOP 


5 


DISS 


SQR5 


CMP5 


CFD5 


UMP5 


5 


SQR5 


CMP5 


CFD5 


UMP3 








UMP5 



Fig. 9. Software pipelining. Calculation of the force consists of five parts: DIS, SQR, CMP, 
CFD, and UMP. (a) Implementation without software pipelining. All calculation are performed 
sequentially, (b) Implementation with software pipelining. The order of the instructions are 
arranged in order to improve the throughput of the instructions. 



not expensive. In such CPU architectures, software pipeUning is effective in improv- 
ing the computational speed. Recent CPUs can handle four or more floating-point 
operations simultaneously if they are independent. The force calculation of one pair, 
however, consists of a sequence of instructions that are dependent on each other. 
Therefore, CPU sometimes has to wait for the previous instruction to be completed, 
which decreases the computational speed. Software pipelining is a technique to re- 
duce such dependencies between instructions in order to increase the efficiency of 
the calculation. The purpose of the software pipening is increasing a number of 
instructions which can be executed simultaneously, and consequently, increasing the 
instruction throughputs in a hardware pipepine. While the software-pipelining tech- 
nique is often used in combination with loop unrolling, the simple loop unrolling can 
cause a shortage of registers leading to access to the memory which can be expensive. 
The main idea of software pipelining is to calculate independent parts of the force 
calculation simultaneously by shifting the order of the instructions. The calculation 
of force can be broken into the following five stages; 
DIS calculate distance between two particles. 
SQR calculate the square of the distance. 
CMP compare the distance and the cutoff length. 
CFD calculate the force using the distance. 
UMP update the momentum of the particle. 

The schematic image of the software pipelining is shown in Fig. [9l The number 
after a stage denotes the index of the pair (the value of the loop counter), e. g., 
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Algorithm 8 Calculating the force with software pipelining 

1: for i = 1 to - 1 do 

2: Qkey ^ 

3: Pkey ^ 

4: fdt ^ 

5: k KeyPointer[i] 

6: ja SortedList[fe] 

7: Ta ^ q[ja] - Qkey 

8: n^o 

9: jb ^ 

10: for k = KeyPointer[z] to KeyPointer[i + 1] -1 do 

11: r^ra 

12: ^ |rp 

13: i ^ ja 

14: ^ SortedList[k+l] 

15: Ta ^ q[ja] - Qkey 

16: if < then 

17: Pkey ^ Pkey + fdt X 

18: p[jb] ^ p[jb] - fdt X Tb 

19: /dt ^ [(24r'^ - 48) /r^'' + 8C2] x 

20: jb ^ j 

21: Tfc ^ r 

22: end if 

23: end for 

24: p[jb] ^ p\jb] - fdt X Tb 

25: p[i] ^ pW + Pkey + fdt X Tb 

26: end for 



DISl denotes the calculation of the distance of the first particle pair. Figure (a) 
denotes the calculation without software pipelining. Suppose the distances between 
third and fifth pairs are longer than the cutoff length. Then the calculation of the 
force and the update the momentum are unnecessary. We refer this to NOP (no 
operation). All calculations are performed sequentially. For instance, SQRl is per- 
formed after DISl is completed, CMPl is performed after SQRl is completed, and 
so on. Figure [9] (a) denotes the calculation with software pipelining. The order 
of the calculations are shifted so that the number of independent calculations are 
increased. For instance, DIS4, SQR3, and UMPl can be performed simultaneously 
since there are no dependencies between them (see the column where the value of the 
loop counter is three). Increase of independent instructions increases the efficiency 
of the hardware pipeline, and consequently, impoves the computational speed The 
pseudo code corresponding to Fig. [9] is shown in Algorithm [HI As far as we know in 
our experience, the software pipelining without loop unrolling described here is the 
best choice for recent architecture such as Intel Xeon processors. 
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Total System 
(Parallel Environment) 

Node 

Node 

Node 

Node 

Fig. 10. Parallel environment. A total system of a supercomputer consists of several nodes. A 
node consists of several CPUs. A CPU consists of several cores. One or two processes are 
executed on a core. 



§4. Parallelization 

4.1. Parallelization Scheme 

Since the size of the system or computational steps are hmited on the single 
process job, parallel computation is required in order to simulate large systems or 
long time steps. Parallel computation is executed on a parallel environment such as 
a supercomputer. A total system of a supercomputer has hierarchic structure, i.e., 
a total system consists of nodes, a node consists of CPUs, a CPU consists of cores 
(see Fig. [T0|) . One or two processes is usually executed on a single core, and a set of 
processes work concertedly on the same task. 

There has been considerable effort devoted to the parallelization of MD simula- 
tions. The parallel algorithms used for MD can be classified into three strategies, do- 
main decomposition, particle decomposition, and force decomposition.^ In domain 
decomposition (also called spatial decomposition) the simulation box is subdivided 
into small domains, and each domain is assigned to each process. In particle de- 
composition, the computational workload is divided and distributed on the basis of 
the particles. Different processes have different particles. In force decomposition the 
computational workload is divided and distributed on the basis of the force calcula- 
tion. Consider a matrix Fij that denotes the force between i- and j'-particles. Force 
decomposition is based on the block decomposition of this force matrix. Each block 
of the force matrix is assigned to each process. Generally, domain decomposition 
is suitable for systems with short-range interactions and simulations with a large 
number of particles ,1^'!^ and force decomposition is suitable for systems with long- 
range interactions such as electrostatic force and simulations with a large number of 
steps. In the present paper, we adopt a simple domain decomposition method for 
parallelization, i.e., we divide a system into small cubes of identical size. We use a 
Message Passing Interface (MPI) for all communication. To perform the computa- 
tion, three types of communication are necessary: (i) exchanging particles that stray 
from the originally associated process, (ii) exchanging information of particles near 
boundaries to consider the cross-border interactions, and (iii) synchronization of the 
validity check of pair lists. 



Node CPU Core 



CPU 




CPU 




CPU 




CPU 








Core 




Core 






Process J 






















CPU 




CPU 




CPU 




CPU 








Core 




Core 






Process ^ 





















Efficient Implementations of MD Simulations for LJ System 



19 



(a) x-direction (b) y-direction (c) z-direction (d) transfer of particles 




Fig. 11. Sending and receiving of positions of particles that are close to boundaries, (a) Sending 
along the a;-direction. First, each process sends/receives the positions of particles near the right 
boundary to/from the right neighbor. Similarly, the positions of particles near the left boundary 
is transmitted, (b) Sending along the y-direction. Positions of particles near the boundaries 
is sent to neighbors, (c) Sending along the z-direction is performed similarly, (d) Transfering 
particles. Suppose there are three neighboring domains associated with Process A, B, and C. 
First Process A sends a particle to Process B along x-axis. Process B sends the particle recieved 
from Process A to Process C. Then Process C recieves particles from Process A via Process 
B. Thus, each process obtains the information of particles from its diagonal neighbors without 
direct communication. 

4.2. Exchanging Particles 

Although all particles are initially in the domains to which they are placed, they 
tend to stray from their original domains as a simulation progresses. Therefore, 
each process should pass the migrated particles to an appropriate process at an 
appropriate time. There is no need for exchanging particles each step since it will 
not cause problems while migration distances are short. Therefore, migrated particles 
are exchanged simultaneously with a reconstruction of pair list. 

4.3. Cross-Border Interactions 

To calculate forces between particle pairs which are assigned to different pro- 
cesses, communication between neighboring processes is necessary. While a naive 
implementation involves communication in 26 directions since each process has 26 
neighbors, the number of communications can be reduced to six by forwarding in- 
formation of particles sent by other processes. Suppose a process is assigned the 
cuboidal domain Sx < x < Cx, Sy < y < Cy, and Sz < z < Cz, and the search length 
is denoted by rg, which is defined in Sec. 12. 3[ The procedures to send and receive 
positions of particles on the boundaries are as follows. 
1: Left: Send positions of particles in the region s^; < x < s^^ -|- rg to the left process, 

and receive positions of particles in the region Sx — rg < x < Sx from the left 

process. 

2: Right: Send the positions of particles in the region ex — rs < x < Cx, and receive 
positions of particles in the region Cx ^ x ^ Cx ~t~ from the right process. 

3: Backward: Send positions of particles in the region Sy < y < Sy + r^ to the 
backward process, including the particles received from the left and the right 
process. Receive positions of particles in the region Sy — rg < y < Sy from the 
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backward process. 

4: Forward: Send the positions of particles in the region Cy — < y < Cy to the 
process in front, including the particles received from left and the right processes. 
Receive positions of particles in the region Cy < y < Cy + Vs from the process in 
front. 

5: Down: Send the positions of particles in the region Sz < z < Sz + Vg to the lower 
process, including the particles received from other processes. Receive positions 
of particles in the region Sz — Ts < z < Sz from the lower process. 
6: Up: Send positions of particles in the region — rg < z < from the upper 
process, including the particles received from other processes. Receive positions 
of particles in the region < z < + rg from the upper process. 
After the above six communications, exchanging positions of particles close to bound- 
aries are completed including between diagonal processes. The details are illustrated 
in Fig. [TTJ Note that the simple implementation of the communication may cause 
deadlock which is a situation that two or more processes will wait responce forever. 
Suppose there are two processes, A and B. If A tries to send a message to B, and B 
also tries to send to A with MPI blocking communications, then the communication 
will not be completed. Similarly, suppose there are three processes. A, B, and C. If 
the communication path are A ^ B, B — )• C, and C — )■ A, then the system is getting 
into a deadlock. In the domain decomposition parallelism with periodic boundary 
condition, it is possible that a deadlock occurs. There are several techniques to 
avoid this problem, such as reversing the order of sending and receiving and using 
nonb locking operations. A simple way to avoid such a deadlock problem is to use 
the MPI_Sendrecv function. This function does not cause deadlock for closed-path 
communications. The pseudocode using MPLSendrecv is shown in Algorithm [9l 
SendDir and RecvDir denote the direction that the process should send to or re- 
cieve from. There are six directions to complete the communication. The number 
of particles to be sent in each direction is fixed until the expiration of a pair list. 
Suppose Ns is a number of particles to send. Then the number of data sent is SNg, 
since only coordinates are required to calculate forces. The message tag denoted by 
Tag, which is an arbitrary non-negative integer but it should be the same for sender 
and receiver. 

4.4. Synchronization of pair list Constructions 

Each process maintains its own pair list and has to reconstruct it when it expires. 
If each process checks the validity of the pair list and reconstructs it independently, 
since it causes idle time, and the calculation effiency dicreases. To solve this problem, 
we synchronize the validity check of pair lists for all processes, that is, we reconstruct 
all the pair lists even when one of the pair lists has expired. While this reduces 
the average lifetime of the pair lists, the overall speed of the simulation is greatly 
improved since the idle time is eliminated (see Fig. I12p. This synchronization process 
can be implemented with MPI_Allreduce, which is a global-reduction function of the 
MPI. Note that the global synchronization of pair lists can have a serious effect on the 
parallel efficiency when a system is highly inhomogeneous such as a system involving 
phase transitions or a heat-conducting system. In such cases, further advanced 
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Algorithm 9 Communications for cross border interactions 

1: for SendDir in [left, right, backward, forward, down, up] do 
2: RecvDir ■h- the opposite direction of SendDir 
3: SendBuf ^ coordinates of particles to send 
4: A^s number of particles to send 
5: Nr number of particles to receive 
6: Prepare RecvBuf 

7: DestRank ^ rank of neighbor on the SendDir. 
8: SrcRank rank of neighbor on the RecvDir. 

9: Call MPl_SendTecY{SendBuf, 3A^s, MPI_DOUBLE, DestRank, Tag, 

RecvBuf, 3Nr, MPLDOUBLE, SrcRank, Tag, MPLCOMM.WORLD) 
10: Update coordinates of particles using RecvBuf 

11: end for 




: ■ ' ■ ■ '■■'■■'■■'■\ Force Calculation 
KSSSSSSS Pair List Construction 
^^^^^^^^ Idle Time 



Fig. 12. Synchronization of pair lists. The numbers in the boxes of force calculation denote the 
time steps. The size of the boxes of force calculation and the pair list construction reflect the 
computational time required for them, i.e., the size of the box is proportional to the computa- 
tional time. Suppose the lifetime of the pair list of Process 1 is 3 steps, that is, the pair list 
should be reconstructed after three force calculations. Similarity, the lifetime of the pairlists are 
4 and 5 for Process 2 and 3, respectively, (a) Without synchronization, if the processes recon- 
struct pair lists independently, then other processes must wait for the reconstruction resulting 
in a huge amount of idle time, (b) With synchronization, if the processes synchronize their own 
pair lists with others, then the idle time vanishes while the average lifetime of pair lists also 
decreases. 



treatment that manages pair lists locally at each location of grid would be required. 

4.5. Results of Benchmark Simulations 

To measure the efficiency of the developed code, we perform benchmark simu- 
lations on two different parallel computers located at the different facilities, which 
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Facility 


Name 


CPU 


Cores Processes 


Memory 


NIFS 


HITACHI SR16000/J2 


IBM P0WER6 (4.7 GHz) 


32 64 (*) 


128GB 


ISSP 


SGI Altix ICE 8400EX 


Intel Xeon X5570 (2.93 GHz) 


8 4 or 8 (**) 


24GB 



Table I. Summary for HITACHI SR16000/J2 at the National Institute for Fusion Science (NIFS) 
and SGI Altix ICE 8400EX at the Institute for Solid State Physics (ISSP). Location, name, 
CPU, number of cores per node, number of processes per node, and memory per node are 
shown. (*) Using simultaneous multithreading (SMT), 64 processes are executed on each node. 
(**) The single-node run is performed with four processes on one CPU and two-or-more-node 
runs are performed with 8 processes per node on two CPUs because of the queue design at ISSP. 



are NIFS and ISSP. The details of the computers are listed in Table HI In the actual 
analysis we used the so-called week scaling analysis, which is the dependence of the 
execution time in terms of node number under the condition of the number of par- 
ticle per process keeping constant. The conditions for the benchmark simulations 
are as follows. Note that, all quantities are normarized by well depth e and atomic 
diameter a of the Lennards-Jones potential defined in Sec. 12.11 

• System size: 100 x 100 x 100 for each process. 

• Number of particles: 500,000 for each process (number density is 0.5). 

• Initial condition: Face-centered-cubic lattice. 

• Boundary condition: Periodic for all axes. 

• Integration scheme: Second-order symplectic integration. 

• Time step: 0.001. 

• Cutoff length: 2.5. 

• Search length: 2.8. 

• Cutoff scheme: Add constant and quadratic terms to potential as shown in 
Eq. 

• Initial velocity: The absolute value of the velocities of all particles are set to be 
0.9 and the directions of the velocities are given randomly. 

• After 150 steps, we measure the calculation (elapsed) time for the next 1000 
steps. 

When a system with N particles is simulated for k steps in t seconds, then the 
number of MUPS is given by xNk/lO^t. At NIFS, we placed 4x4x4 cubes at each 
node, and assigned one cube to each process (see Fig. [T3]) . since each node has 32 
cores, and we placed 64 processes on each node utilizing simultaneous multithreading 
(SMT). At ISSP, we placed 2x2x2 cubes at each node, and assigned one cube to 
each process, since one node consists of two CPUs (8 cores) at ISSP. 

The results obtained at NIFS and ISSP are summarized in Table Hll and Table Hill 
respectively. Both results are shown in Fig. [TH While the elapsed time of the 2-node 
calculation at NIFS is 302.6 s, that of the 128-node calculation is 412.7 s. Similarly, 
while the elapsed time of the single node calculation at ISSP is 183.13 s, that of the 
1024-node calculation is 272.478 s. Since we perform weak scaling, the elapsed time 
should be independent of the number of processes provided the computations are 
perfectly parallelized. Therefore, the increase in elapsed time of the calculation with 
larger nodes is regarded as being due to the parallel overhead. 
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1 Process 1 Node Total System 




0.5 Million Particles 64 Processes 8 Nodes 



Fig. 13. Domain decomposition for parallelization at NIFS. Each process is assigned a cube with 
a length of 100 that contains 500,000 particles. Each node is assigned to a large cube made up 
of 4 X 4 X 4 small cubes. The case of eight nodes is shown as an example. 



Nodes 


Processes 


Particles 


Elapsed Time [s] 


Speed [MUPS] 


Efficiency 


2 


128 


64,000,000 


302.623 


211.484 


1.00 


4 


256 


128,000,000 


305.469 


419.027 


0.99 


8 


512 


256,000,000 


309.688 


826.638 


0.98 


16 


1024 


512,000,000 


325.744 


1571.79 


0.93 


32 


2048 


1,024,000,000 


333.140 


3073.78 


0.91 


64 


4096 


2,048,000,000 


385.765 


5308.93 


0.78 


128 


8192 


4,096,000,000 


412.717 


9924.48 


0.73 



Table II. Results of benchmark simulations at NIFS. Numbers of nodes, numbers of processes, 
numbers of particles, elapsed times [s], and speeds [MUPS] are listed. Efficiency is estimated 
based on the elapsed time obtaind by the case of the 2-node calculation, since a single-node 
calculation is not permitted at NIFS. 



Nodes 


Processes 


Particles 


Elapsed Time [s] 


Speed [MUPS] Efficiency 


1 


2000000 


183.13 


10.9212 


1.00 


2 


8000000 


186.296 


42.9424 


0.98 


4 


16000000 


187.751 


85.2194 


0.98 


8 


32000000 


190.325 


168.133 


0.96 


16 


64000000 


194.932 


328.32 


0.94 


32 


128000000 


203.838 


627.95 


0.90 


64 


256000000 


224.028 


1142.71 


0.82 


128 


512000000 


228.485 


2240.84 


0.80 


1024 


4096000000 


272.478 


15032.4 


0.66 



Table III. Same as Table [III for ISSP. Efficiency is determined relative to the elapsed time for the 
single-node calculation. 

§5. Parallel Overhead 

The benchmark results described in the previous section show that the paral- 
lelization efficiency decreases as a number of processes increases, while the ammount 
of communication in one step seems to be negligible compared with the computa- 
tions. In the present section, we investigate the cause of the parallel overhead of our 
codes in order to identify the factor of the decrease in parallelization efficiency. All 
simulations presented here are performed under the same condition as that used for 
the benchmarks in Sec. 14.51 
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Fig. 14. Results of benchmark simulations. The 
ISSP by filled symbols. ET and Eff denotes 
increase in elapsed time is due to the parallel 



results at NIFS are denoted by open symbols and 
elapsed time [s] and efficiency, respectively. The 
overhead. 



5.1. Granularity 

In parallel computing, the ratio of computation to communication is important. 
This ratio is called the granularity. Fine-grain parallelism refers to the situation that 
data are transferred frequently and computational tasks are relatively light compared 
with communication. Coarse-grained parallelism, on the other hand, refers to the 
situation that computational tasks are relatively heavy compared with communica- 
tion. In general, a program having coarse granularity exhibits better performance 
for a large number of processes. Here, we estimate the granularity of our code. 
Although the interaction length is set to 2.5, each process has to send and receive 
information of particles whose distances from boundaries are less than 2.8, since 
we adopt the DUTC method and the search length is set to 2.8. Each process is 
assigned to a cube of length 100 and the number density is set to 0.5. The commu- 
nications are performed along the x-, y-, and z-directions, and two communications 
are involved for each direction. The amount of transmitting data increases in order 
of the X-, y-, and z- directions, since the received data of particles are forwarded. 
The position of each particle is expressed by three double-precision floating-point 
numbers, requiring 3 x 8 = 24 bytes. The amount of communication along y-axis is 
larger than that along x-axis, since the information of particles recieved along x-axis 
are transfered to y-axis (see Fig. [TTj) . The number of particles transmitted along 
the X-axis is 100 x 100 x 2.8 x 0.5 = 14, 000. Therefore, 336 KB are transmitted 
along the x-axis. Similarly, (100 + 2.8 x 2) x 100 x 2.8 x 0.5 x 24 = 355 KB of 
data are transmitted along the y-axis and 375 KB of data are transmitted along the 
z-axis. Since inner-node communication is performed on shared memory and should 
be sufficiently faster than internode communication, we only consider the internode 
communication. Each node at NIFS has 4 x 4 x 4 = 64 cubes, and therefore, 16 
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processes involve internode communication. The largest amount of communication 
is along the z-axis, which is about 375 KB xl6 = 6 MB. The bandwidth between 
each node at NIFS is about 5 GB/s. The total amount of internode communication 
is at most 36 MB, and therefore, the time spent on communication will be about 7.2 
ms for each step. The computational time without communication is about 0.3 s, 
which is 500 times longer than the estimated communication time. This implies that 
this code has coarse-grained parallelism and should be less affected by increasing the 
number of processes. The benchmark results, however, show a significant decrease 
in parallel efficiency for large-scale runs. 

We have also investigated the communication set up time at Kyushu University. 
The supercomputer at Kyushu University is HITACHI SR16000/J2 which is same 
as that at NIFS. But it is 42 nodes while NIFS has 128 nodes. We observed the 
communication set up time for MPI_Barrier and MPI_Allreduce over full of nodes, 
and found that the set up times are at most 50 us and 130 us, respectively. Therefore, 
communication cannot be a main factor of the parallel overhead. 



Nodes 


Total [s] 


Force [s] 


pair list [s] 


Communication [s] 


2 


304.609 


217.403 


38.26566 


12.54 


128 


404.335 


217.733 


41.85143 


14.39 



Table IV. Breakdowns of the computational time at NIFS. Total, Force, pair list, and Commu- 
nication denote the total time of computation [s], the time spent on force calculations [s], the 
time required to construct pair lists [s], and communication time [s], respectively. 



Nodes 


Total [s] 


Force [s] 


pair list [s] 


Communication [s] 


1 


180.945 


144.92 


14.388425 


4.12 


256 


228.614 


144.638 


16.299467 


7.92 



Table V. Same as Table HVl for ISSP. 



5.2. Lifetime of Pair Lists 

Since we perform weak scaling analysis, the size of the system increases as the 
number of nodes increases. This can decrease the lifetime of pair lists, since the max- 
imum speed of the particles increases in a larger system and faster particles reduce 
the lifetime of pair lists.123 The construction of a pair list takes much longer time 
than the force comutation takes, and therefore, the simulation run time is strongly 
affected if the number of pair lists being constructed increases. This increases the 
simulation run time even if the simulation is perfectly parallelized, and therefore, 
it is inappropriate to regard this cost as the parallelization cost. Therefore, we in- 
vestigate the system-size dependence on the time required for construction of pair 
lists. At NIFS, the pair list was reconstructed 20 times for the 2- node run while it 
was reconstructed 21 times for the 128-node run. The average life time is 51.2 steps 
for 2-node and 53.8 steps for 128-node caluclations. Therefore, the computational 
cost to construct pair lists increases for larger number of nodes, but the difference is 
small. The breakdowns of the computational time for the smallest and the largest 
runs are listed in Table ITVl and IVl and are also shown in Fig. [TJl The additional cost 
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Time [SEC] 




(a) NIFS 

2 nodes 



1 28 nodes 



(b) ISSP 

1 node 



256 nodes 



Fig. 15. Breakdowns of the computational time for (a) NIFS and (b) ISSP. The smallest and largest 
runs are shown for both cases. Although the communication time increases as the number of 
nodes increases, the increase in time for communication is at most 8% of the total increase, 
which cannot explain the observed parallel overhead. The idle time due to the synchronization 
is counted as "Others". 



of the pair list construction at NIFS is only 4 [s] which cannot explain the parallel 
overhead, which is over 100 [s]. Therefore, the lifetime of pair lists is not the main 
factor causing the reduced parallel performance. 

5.3. Communication Management 

Each simulation involves two types of communication at each step. One is point- 
to-point communication for sending and receiving data of particles, the other is 
collective communication for the synchronization of pair lists. The point-to-point 
communication is implemented in a simple way, that is, all processes send data in 
one direction and receive data from the opposite direction. All processes send and re- 
ceive data simultaneously in the same direction using MPLSendrecv (see Fig. 1161(a)). 
While the simple implementation of the communication may cause deadlock for the 
system with THE periodic boundary condition, the function MPLSendrecv avoid the 
deadlock implicitly by reordering of communication, which may cause some overhead 
in parallel efficiency for a large number of processes. Therefore, we examine different 
ways of communication which avoid the deadlock explicitly in order to investigate 
whether the communication management by MPLSendrecv is the factor of the par- 
allel overhead. Here we performed three methods. The first method, referred to as 
'Oneway', is the simple implementation described in Sec. 14.31 The second method, re- 
ferred to as 'Two-step Oneway', involves communications in 2 stages (see Fig. 1161(b)). 
All processes are classified into two groups, even and odd. In the first of the two 
stages, even processes send data and odd processes receive data. In the second stage. 
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(a) Oneway (b) Two-step Oneway (c) Painwise 




Fig. 16. Communication methods, (a) Oneway: All processes send information in the same direc- 
tion, and receive the data simultaneously using MPl_Sendrecv. (b) Two-step Oneway: Although 
all processes send information in the same direction, the sending and receiving processes are 
explicitly separated using MPI_Send and MPI_Recv. (c) Pairwise: Each process is classified as 
even or odd. When the even processes exchange the positions of particles with their left neigh- 
bors, the odd processes exchange the positions of particles with their right neighbors. Then the 
directions of communication are reversed. Communication is implemented by MPI_Sendrecv. 



data are transmitted in the reverse direction. The communications are implemented 
by the blocking functions MPLSend and MPLRecv. The third method, referred to 
as 'Pairwise', arranges the order of communication explicitly. All processes are clas- 
sified into two groups, even and odd, similarly to in the previous method. When the 
even processes exchange data with their right neighbors, the other processes exchange 
data with their left neighbors, and vice versa (see Fig. [16] (c)). The communications 
are implemented by MPI_Sendrecv. Note that, the behavior of 'Pairwise' at ISSP is 
not monotonic, i.e., the data of 128-node run is much faster than those of 8-, 16-, 
32-, and 64-node runs. This may be due to the network topology at ISSP which is 
hypercube, however, the details should be investigated later. 

A comparison of benchmark test for above the three methods is shown in Fig. [T71 
Figure [TTl (a) shows the results at NIFS. The results of Pairwise only slightly differ 
from those of Oneway. While the results of Two-step Oneway show similar per- 
formance up to 32 nodes, those for larger nodes are significantly slower than the 
other methods. Figure [T71 (b) shows the results at ISSP. There are major differences 
between the performances of the three methods even though the amounts of com- 
munication are identical, and Two-step Oneway considerably slower than the other 
methods. The reason why Two-step Oneway is the slowest method in both cases 
may be that double threads are executed by separating the sending and receiving 
processes instead of using MPLSendrecv. While it is unclear why the results of the 
three methods are almost the same for the largest number of nodes at ISSP, we can 
conclude that the overhead of communications strongly depends on the hardware 
environment. Considering the above results at NIFS and ISSP, the simple Oneway 
scheme is found to be less affected by the architecture of parallel machine and con- 
sistently has relatively good performance. 

5.4. Synchronicity Problem 

Parallel programs adopting the single program multiple data (SPMD) model 
usually require synchronicity, and the parallel MD described here is one such exam- 
ple. The slowest process determines the overall performance, since other processes 
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Fig. 17. Comparison between three communication schemes at (a) NIFS and (b) ISSP. The Oneway 
scheme exhibits the best performance at both facihties. 



must wait for the slowest process owing to the synchronization. Therefore, the par- 
ahel efficiency is strongly affected by the fluctuation of the computation time of each 
process. One of the factors causing the fluctuation is load imbalance. Load imbal- 
ance refers to the inhomogeneity of workload among processes. Unequally assigned 
tasks cause the fluctuation of the computational time of each step and they decrease 
the parallel efficiency. The workload of parallel MD mainly depends on the num- 
ber of interacting pairs in each process. To clarify this problem quantitatively, the 
number of interacting pairs is counted at the flnal state of the 128-node run. The 
parallel efficiency is determined by the slowest process which should be with the 
largest number of interacting pairs. 

In our calculation, we found that the largest and the smallest number of pairs 
were 11,656,318 and 11,629,010, respectively. The difference is about 0.02%, which 
cannot be responsible for the reduction in parallel efficiency. Note that this small 
fluctuation originates from the fact that the initial condition is homogeneous. The 
load imbalance can be serious for inhomogeneous systems such as systems involving 
phase transitions. However, our benchmark results show that the parallel efficiency 
decreases as the number of processes increases, even for a homogeneous system with 
a uniform workload for each process. Therefore, we assumed that there are other 
factors that affect the parallel efficiency. 

Next, we investigate the fluctuation of the computational time of force for each 
process. Suppose is the computational time spent for force calculation of p-th 
process at z-th step. The average of the force calculation at i-th step is defined as 

P p 

(i) 

with the number of processes Np. The time evolutions of tave at NIFS and ISSP are 
shown in Fig. [T51 The data of the smallest and the largest runs are presented. Note 
that, we observe the time spent only for force calculation which dose not involve 
communication. We found that the computational time for force show almost same 
values and independent of the number of nodes. However, the compuational time 
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for each step itself might fluctuate, which cause the one of candidate of reduction of 
parallel efficiency. Therefore, we focus on the calculation time of force for each step. 
To investigate the fluctuation of the computational time for each step, we define the 
relative difference as 

t^^ -t^'^ 

fW _ '■max ''min /c ^A 

diff (i) ' V 

t-ave 

where tmax = ™3'Xp{t|^''^} is the maximum time, and t^^^ = minp{t|^''^} is the min- 
imum time spent on the force calculation among all processes at ith step^ ^ff 
denotes the difference between the times spent on the slowest and fastest processes 
normalized by the average time at ith step. The time evolutions of the average and 
relative difference for force calculations at NIFS are shown in Fig. [19] (a). While the 
average times for force calculations are identical for 2- and 128-node runs, the rela- 
tive differences of the large run are significantly larger than those of the small run. 
The relative difference of the large run sometimes fluctuates by 100% (t^jg becomes 
1.0), which implies that the computational time is doubled at the step. The same 
plots for ISSP are shown in Fig. [TH] (b). Similar phenomena are observed, although 
the relative differences of the large run are at most 40%. The relative differences of 
the small run at ISSP are almost zero, since the run has only four processes, which 
is the reason why the fluctuation is relatively small. 

From the above, we can conclude that the main factor degrading the parallel 
efficiency is the fluctuation of the computational time at each step. Although the 
average time varies only slightly, the fluctuation of the computational time between 
processes increases as the number of processes increases. Then the delays of the 
slowest process accumulate as the overhead. There are several possible factors caus- 
ing this fluctuation. One possible source is noise from an operating system (OS). An 
OS has several background processes called daemons, which interfere with the user 
execution time and decreases its efficiency This noise from OS is referred to as OS 
jitter or OS detour. Interruptions by an operating system occur randomly, and they 
cause the fluctuation of the execution time of each process. Not only interruptions 
themselves, but also side effects caused by the interruptions, such as pollution of the 
CPU cache, are also a source of noise. 



§6. Summary and Further Issues 



We have developed an MD code that is not only suitable for massively parallel 
processors but also exhibits good single-processor performance for IBM Power and 
Intel Xeon architectures. Most of all algorithms presented in this paper can be 
applied to other particle systems with short-range interactions including many-body 
interactions. There are three levels of optimization. At the highest level, we attempt 
to reduce the computational cost by developing or choosing appropriate algorithms 
such as the bookkeeping method. Next, we have to manage memory access. All data 
required by cores should be in the cache so that the cores make the best use of its 
computational power. Finally, we have to perform architecture-dependent tuning. 
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Fig. 18. Computational CPU time for force calculations averaged over all processes at each step, 
(a) The data obtained at NIFS. The data obtained from the smallest and largest runs are 
denoted by '2' and '128', respectively, (b) The data obtained at ISSP. The data obtained from 
the smallest and largest runs are denoted by '1' and '256', respectively. Time evolutions are 
almost independent of the number of nodes. 
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since the development paradigms are considerably different between different CPUs. 

MD is suitable for MPP-type simulations since the computation-communication 
ratio is generally small. However, we have observed a significant degradation in 
parallel efficiency despite the use of conditions under which the communication cost 
should be negligible. We have found that the fluctuation of the execution time 
of processes reduces the parallel efficiency. The fluctuations become serious when 
the number of process becomes over one thousand. The OS jitter, which is the 
interference by the operating system, can be a source of the fluctuation. There 
are many attempts to handle the OS jitter, but it is difficult to be settled by user 
programs. One possibility is to adopt hybrid parallelization, i.e., use OpenMP for 
inner-node communications and use MPI for internodc communications. Hybrid 
parallelization not only reduces the number of total MPI processes, but also averages 
out the fluctuations which can improve the parallel efficiency. Note that the affect 
due to the OS jitter is generally order of nanoseconds or shorter, and the time 
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required to perform one step of MD simulations is order of millisecond. Therefore, 
there may be other sources of noise, such as translation lookaside buffer (TLB) miss. 
TLB is a kind of cache, which translate the address from virtual to reahl^D While it 
makes computation faster, it takes long time to fetch the data if the required address 
is not cached in the buffer. The translation tables are somtimes swapped out into 
other storage such as hard-disk, then the penalty of TLB miss can be expensive. 
Investigation of the source of the fluctuation in execution time is one of the further 
issues. 

In this paper, the load imbalance problem is not considered. Load balancing 
is important for parallelization with a huge number of processes since the fluctua- 
tion of the timing of each process greatly degrades the parallel efficiency as shown 
in Sec. 15.41 There are two approaches to load balancing, one is based on force de- 
composition and the other is based on domain decomposition. The load balancing 
in NAMD (Not Another Molecular Dynamics program is essentially achieved by 
the force decomposition, while NAMD adopts a hybrid strategy of parallelization. 
The processes of NAMD are under the control of a dynamic load balancer. The 
simulation box is divided into subdomains (called patches), and the total compu- 
tational task is divided into subtasks (called compute objects). The load balancer 
measures the load on each process and reassigns compute objects to processors if 
necessary. GROMACS (Groningen Machine for Chemical Simulations )P adopts a 
domain-decomposition-based strategy for load balancing, that is, it changes the vol- 
ume of domains assigned to processors to improve the load balance. The simulation 
box is divided into staggered grids whose volume are different. A load balancer 
changes the volume of each grid by moving the boundaries of cells to reduce load 
imbalance. There is also a method of non-box-type space decomposition that uti- 
lizes the Voronoi construction. Reference points are placed in a simulation box and 
each Voronoi cell is assigned to a process. Load balancing is performed by changing 
the positions of the reference points. Generally speaking, force-decomposition- 
based load balancing is a better choice if a system contains long-range interactions 
such as electrostatic force, and domain-decomposition-based load balancing is better 
for a system with short-range interactions. However, the choice strongly depends on 
the phenomenon to be simulated, and therefore, both strategies or some hybrid of 
them should be considered. 

The source codes used in this paper has been published online.ISS We hope 
that the present paper and our source codes will help researchers to develop their 
own parallel MD codes that can utilize the computational power of petaflop, and 
eventually exaflop machines. 
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